Introducing six new cryptocurrencies in BigQuery Public Datasets—and how to analyze them

Since they emerged in 2009, cryptocurrencies have experienced their share of volatility—and are a continual source of fascination. In the past year, as part of the BigQuery Public Datasets program, Google Cloud released datasets consisting of the blockchain transaction history for Bitcoin and Ethereum, to help you better understand cryptocurrency. Today, we’re releasing an additional six cryptocurrency blockchains.We are also including a set of queries and views that map all blockchain datasets to a double-entry book data structure that enables multi-chain meta-analyses, as well as integration with conventional financial record processing systems.Additional blockchain datasetsThe six cryptocurrency blockchain datasets we’re releasing today are Bitcoin Cash, Dash, Dogecoin, Ethereum Classic, Litecoin, and Zcash.Five of these datasets, along with the previously published Bitcoin dataset now follow a common schema that enables comparative analyses. We are releasing this group of Bitcoin-like datasets (Bitcoin, Bitcoin Cash, Dash, Dogecoin, Litecoin and Zcash) together because they all have similar implementations, i.e., their source code is derived from Bitcoin’s. Similarly, we’re also releasing the Ethereum Classic dataset alongside the previously published Ethereum dataset, and Ethereum Classic is also using the same common schema.A unified data ingest architectureAll datasets update every 24 hours via a common codebase, the Blockchain ETL ingestion framework (built with Cloud Composer, previously described here), to accommodate a variety of Bitcoin-like cryptocurrencies. While this means higher latency for loading Bitcoin blocks into BigQuery, it also means that:We are able to ingest additional BigQuery datasets with less effort, meaning additional datasets can be onboarded more quickly in the future.We can implement a low-latency loading solution once that can be used to enable real-time streaming transactions for all blockchains.Unified schema and viewsSince we provided the original Bitcoin dataset last year, we’ve learned how users want to access data, and restructured the dataset accordingly. Some of these changes address performance and convenience concerns, yielding faster and lower cost queries (commonly accessed nested data are denormalized; each table is partitioned by time).We’ve also included more data, such as script op-codes. Most Bitcoin transactions describe transfers of value not simply as a debit/credit pair, but rather as a series of functions that describe both simple transfers and more complex transactions.Having these scripts available for Bitcoin-like datasets enables more advanced analyses similar to this smart contract analyzer that Tomasz Kolinko recently built on top of the BigQuery Ethereum dataset. For example, we can now identify and report on patterns of activity involving multi-signature wallets. This is particularly important for analyzing privacy-oriented cryptocurrencies like Zcash.For analytics interoperability, we designed a unified schema that allows all Bitcoin-like datasets to share queries. To further interoperate with Ethereum and ERC-20 token transactions, we also created some views that abstract the blockchain ledger to be presented as a double-entry accounting ledger.Double-entry book view: example queriesTo motivate an initial exploration of these new datasets, let’s start with a simple example, comparing the way to query both payments and receipts across multiple cryptocurrencies. This comparison is the simplest way to verify that a cryptocurrency is operating as intended, and at least operationally, is a mathematically correct store of value.1. Balance queries demonstrating preservation of valueHeres are some equivalent balance queries for the Bitcoin and Dogecoin datasets:Note that the only difference between them is the name of the data location. You can swap in Bitcoin Cash, Dash, Litecoin, and Zcash in a similar fashion.2. Understanding miner economics on BitcoinThe BigQuery dataset makes it possible to analyze how miners are allocating space in the blocks they mine.This query shows that transaction fees on the bitcoin network follows a Poisson distribution, confirming that there are zero-fee transactions being included in mined blocks.Given that miners are incentivized to profit from transaction fees, it begs the question: why are they including zero-fee transactions? Possible reasons include:Miners are including their own transactions for zero fees.Miners run transaction accelerators, i.e., off-chain services that allows transactors to pay mining fees out-of-band (typically with fiat currency) for the purpose of accelerating confirmation of transactions.3. Understanding how often Bitcoin addresses are reusedOver 91% of addresses on the Bitcoin network have been used only once.Creating a new Bitcoin address for each inbound payment is a suggested best practice for users seeking to protect their privacy. This is because using blockchain analytics it is possible to identify which other addresses a given user’s wallet has transacted with and the size of the shared transactions.This query can be plotted to show the relationship between addresses and the number of transacting partners:Multi-chain crypto-econometricsBeyond quality control and auditing applications, presenting cryptocurrency in a traditional format enables integration with other financial data management systems. As an example, let’s consider a common economic measure, the Gini Coefficient. In the field of macroeconomics, the Gini Coefficient is a member of a family of econometric measures of wealth inequality. Values range between 0.0 and 1.0, with completely distributed wealth (all members have the same amount) mapping to a value of 0.0 and completely accumulated wealth (one member has everything) mapping to 1.0.Typically, the Gini Coefficient is estimated for a specific country’s economy based on data sampling or imputation. For crypto-economies, we have complete transparency of the data at the highest possible resolution.In addition to data transparency, one of the purported benefits of cryptocurrencies is that they allow the implementation of money to more closely resemble the implementation of digital information. It follows that a fully digitized money network will come to resemble the internet, with reduced transactional friction and fewer barriers that impede capital flow. Frequently, implicit in this narrative is that capital will distribute more equally. But we don’t always observe that particular outcome, and the crypto-assets presented here display a broad spectrum of distribution patterns over time. You can read more about using the Gini coefficient to reason about crypto-economic network performance in Quantifying Decentralization.To set a baseline to interpret our findings, consider how resources are distributed in traditional, non-crypto economies. According to a World Bank analysis in 2013, recent Gini coefficients for world economies have a mean value of 39.6 (with a standard deviation of 9.6). We plot a histogram of the reported data below. Some recent Gini measures include:South Africa (2010): 67Sweden (2008): 26United States (2011): 48Venezuela (2011): 39We use the double-entry book pattern to compare the equality of cryptocurrency distribution of the Bitcoin-like datasets being released today along with Ethereum and a few Ethereum-based ERC-20 tokens. Primary data were normalized using a few different views (BTC-family to DE-Book, Ethereum to DE-Book, and ERC-20 to DE-book).In the figure below, the Gini coefficient is rendered for the top 10,000 address balances within each dataset, tabulated daily and across the entire history. The Bitcoin-like cryptocurrencies are rendered in ochre tones while the Ethereum chains and ERC-20 Maker token are rendered in blue tones. Note that Bitcoin Cash is rendered as a dotted line, diverging from Bitcoin in mid-2017. Similarly, Ethereum classic diverges as a dotted line away from Ethereum.It’s difficult to make conclusive statements about the crypto-economies from the Gini coefficient for the following reasons:Many of the crypto-assets are stored in exchanges and don’t correspond to individual holders. This biases the Gini coefficient toward accumulation.Gini is known to be sensitive to including small balances in the analysis and is usually done on large addresses only. Removing small balances, as we did here, biases the Gini coefficient toward distribution.In our analysis all addresses are treated as individual holders. In reality, multiple addresses can belong to the same individual. This can bias the Gini either toward accumulation or distribution.And when examining the chart to compare specific cryptocurrencies:Zcash in particular is difficult to measure because it has many so-called shielded transactions that produce addresses for which the balance cannot be accurately tabulated. It’s not clear in which direction shielded transactions bias the Gini coefficient. However we do speculate that there is asymmetric interest in using shielded transactions: larger holders are more likely to want to keep their holdings private and it follows that Gini for Zcash is probably biased toward distribution.Dash has a system property whereby interest payments may be earned from the network by address balances that hold a minimum 1000 DASH. Large asset holders are incentivized to split holdings amongst multiple addresses, which biases Gini toward distribution. Even so, Dash is remarkably well distributed relative to all other cryptocurrencies examined here.Bitcoin Cash was purportedly created to increase transfer-of-value use cases through lower transaction fees, which should ultimately lead to a lower Gini coefficient of address balances. However, we see that the opposite is true—Bitcoin Cash holdings have actually accumulated since Bitcoin Cash forked from Bitcoin. Similarly, the Ethereum Classic currency was rapidly accumulated post-divergence and remains so.The ERC-20 token Maker (a stablecoin) has a distribution that is decoupled from its parent chain, Ethereum. Maker was issued as distinct asset on the Ethereum chain, in contrast to Ethereum’s native currency, Ether.In early December 2018, Bitcoin, Ethereum, and Litecoin had a major distribution event, while Bitcoin Cash had a major accumulation event. This was the largest redistribution of large Bitcoin balances since December, 2011. The Bitcoin redistribution appears to be related to an announced Coinbase reorganization of funds storage. Given the synchronization of movements, it is likely that the Ethereum redistribution was also Coinbase activity.  Here’s the code to query the participating addresses. Also find a visualization of the distribution event below, with addresses as circles and lines between circles as value transfers. The original holding address is at the center. Sizes are determined by the post-event distribution of value, with peripheral circle areas proportional to the final balance and edge weights are proportional to the logarithm of the amount of Ether transferred.Studies in the domains of ecology and network science tell us that biodiversity is positively correlated to ecological stability and increases ecosystem productivity by supporting more complex community structures. The downward trend of Gini (i.e. higher levels of diversity) for crypto-asset holdings is likely a positive sign for the future health of crypto-economies.The Gini coefficient is but one of a number of econometric indicators of wealth inequality, and other indicators may give contradictory results. Rather than drawing conclusions from the analysis presented here, we emphasize that we’ve built useful infrastructure for performing analysis, and fully expect that motivated analysts will swap in their own methods.Address classificationBlockchain transaction history can be aggregated by address and used to analyze user behavior. To motivate further exploration, we present a simple classifier that can detect Bitcoin mining pools. As a brief historical note, mining pools were created when the difficulty of mining Bitcoin reached such a level that rewards could be expected only once every few years. Miners began to pool their resources to earn a smaller share of rewards more consistently and in proportion to their contribution to the pool in which they were mining.First, we constructed 26 feature vectors to characterize incoming and outgoing transaction flows to each address. Next, we trained the model using labels derived from transaction signatures. Many large mining pools identify themselves in the signature of blocks’ Coinbase transactions. Parsing these signatures, we labelled 10,000 addresses as belonging to known mining pools. One million other addresses were included in the dataset as “non-miners.” The query used to generate our features and labels can be seen here, and the source code for this analysis can be found in a Kaggle notebook here.Model selectionWe used a random forest classification model for its strong out-of-the-box effectiveness at building a good classifier and ability to model nonlinear effects.Because known mining pools are a very small percentage of our data, we are interested in correctly identifying as many of them as possible. In other words, we focused on maximizing recall. To ensure the minority class is adequately represented, we weighted classes in inverse proportion to how frequently they appear in the data.Interpreting the resultsThe confusion matrix below summarizes the performance of the classification model on a subset of addresses reserved for model testing. False positives (in the upper right quadrant) merit closer inspection. These addresses may belong to “dark” mining pools, i.e., those which are not publicly known or do not identify themselves in Coinbase transaction signatures.Because our dataset is imbalanced, as you can see in the matrix above, it is useful to examine the relationship between precision and recall. The model threshold can be adjusted to increase recall (less false negatives), but at the expense of decreased precision (more false positives).We can examine relative feature importance to determine which features are the strongest predictors in our model. Unsurprisingly, given that mining pools are making many small payments to the cooperating members, the following features have the most predictive power for a mining pool address:Number of output transactionsTotal number of transaction outputsTotal number of transaction inputsFor a deeper understanding of query performance on the blockchain, check out a comparison of transaction throughputs for blockchains in BigQuery..Next stepsTo get started exploring the new datasets, here are links to them in BigQuery:Bitcoin (new location): bigquery-public-data.crypto_bitcoinBitcoin Cash: bigquery-public-data.crypto_bitcoin_cashDash: bigquery-public-data.crypto_dashDogecoin: bigquery-public-data.crypto_dogecoinEthereum (new location): bigquery-public-data.crypto_ethereumEthereum Classic: bigquery-public-data.crypto_ethereum_classicLitecoin: bigquery-public-data.crypto_litecoinZcash: bigquery-public-data.crypto_zcashThere’s also a Kaggle notebook that illustrates how to import data into a notebook for applying machine learning algorithms to the data. We hope these new public datasets encourage you to try out BigQuery and BigQuery ML for yourself. Or, if you run your own enterprise-focused blockchain, these datasets and sample queries can guide you as you form your own blockchain analytics.Until then, if you have questions about this blog post, feel free to reach out to the authors on Twitter: Allen Day,Evgeny Medvedev, Nirmal AK, and Will Price. And here’s a shout-out to the outside contributors who helped develop and review this blog post: Gitcoin, for supporting Blockchain ETL;Samuel Omidiora and Yaz Khoury, for contributing to Blockchain ETL; and Aleksey Studnev of Bloxy for valuable discussions of analyses.
Quelle: Google Cloud Platform

Exoplanets, astrobiological research, and Google Cloud: What we learned from NASA FDL’s Reddit AMA

Are we alone in the universe? Does intelligent life exist on other planets? If you’ve ever wondered about these things, you’re not the only one. Last summer, we partnered with NASA’s Frontier Development Lab (FDL) to help find answers to these questions—you can read about some of this work in this blog post. And as part of this work we partnered with FDL researchers to host an AMA (“ask me anything”) to answer all those burning questions from Redditlings far and wide. Here are some of the highlights:Question: What can AI do to detect intelligent life on other planets?Massimo Mascaro, Google Cloud Director of Applied AI: AI can help extract the maximum information from the very faint and noisy signals we can get from our best instruments. AI is really good at detecting anomalies and in digging through large amounts of data and that’s pretty much what we do when we search for life in space.Question: About how much data is expected to be generated during this mission? Are we looking at the terabyte, 10s of terabytes, or 100s of terabytes of data?Megan Ansdell, Planetary Scientist with a specialty in exoplanets: The TESS mission will download ~6 TB of data every month as it observes a new sector of sky containing 16,000 target stars at 2-minute cadence. The mission lifetime is at least 2 years, which means TESS will produce on the order of 150 TB of data. You can learn more about the open source deep learning models that have been developed to sort through the data here.Question: What does it mean to simulate atmospheres?Giada Arney, Astronomy and astrobiology (mentor): Simulating atmospheres for me involves running computer models where I provide inputs to the computer on gases in the atmosphere, “boundary conditions”, temperature and more. These atmospheres can then be used to simulate telescopic observations of similar exoplanets so that we can predict what atmospheric features might be observable with future observatories for different types of atmospheres.Question: How useful is a simulated exoplanet database?Massimo Mascaro: It’s important to have a way to simulate the variability of the data you could observe, before observing it, to understand your ability to distinguish patterns, to plan on how to build and operate instruments and even to plan how to analyze the data eventually.Giada Arney: Having a database of different types of simulated worlds will allow us to predict what types of properties we’ll be able to observe on a diverse suite of planets. Knowing these properties will then help us to think about the technological requirements of future exoplanet observing telescopes, allowing us to anticipate the unexpected!Question: Which off-the-shelf Google Cloud AI/ML APIs are you using?Massimo Mascaro, Google Cloud Director of Applied AI: We’ve leveraged a lot of Google Cloud’s infrastructure, in particular Compute Engine and GKE, to both experiment with data and to run computation on large scale (using up to 2500 machines simultaneously), as well as TensorFlow and PyTorch running on Google Cloud to train deep learning models for the exoplanets and astrobiology experiments.Question: What advancements in science can become useful in the future other than AI?Massimo Mascaro: AI is just one of the techniques science can benefit in our times. I would put in that league definitely the wide access to computation. This is not only helping science in data analysis and AI, but in simulation, instrument design, communication, etc.Question: What do you think are the key things that will inspire the next generation of astrophysicists, astrobiologists, and data scientists?Sara Jennings, Deputy Director, NASA FDL: For future data scientists, I think it will be the cool problems like the ones we tackle at NASA FDL, which they will be able to solve using new and ever increasing data and techniques. With new instruments and data analysis techniques getting so much better, we’re now at a moment where asking question such as whether there’s life outside our planet is not anymore preposterous, but real scientific work.Daniel Angerhausen, Astrophysicist with expertise spanning astrobiology to exoplanets (mentor): I think one really important point is that we see more and more women in science. This will be such a great inspiration for girls to pursue careers in STEM. For most of the history of science we were just using 50 percent of our potential and this will hopefully be changed by our generation.You can read the full AMA transcript here.
Quelle: Google Cloud Platform

The service mesh era: Advanced application deployments and traffic management with Istio on GKE

Welcome back to our series about the Istio service mesh. In our last post, we explored the benefits of using a service mesh, and placed Istio in context with other developments in the cloud-native ecosystem. Today, we’ll dive into the “what” and “how” of installing and using Istio with a real application. Our goal is to demonstrate how Istio can help your organization decrease complexity, increase automation, and ease the burden of application management on your operations and development teams.Install with ease; update automaticallyWhen done right, a service mesh should feel like magic: a platform layer that “just works,” freeing up your organization to use its features to secure, connect, and observe traffic between your services. So if Istio is a platform layer, why doesn’t it come preinstalled with Kubernetes? If Istio is middleware, why are we asking developers to install it?At Google, we are working on simplifying adoption by providing a one-click method of installing Istio on Kubernetes. Istio on GKEhttps://cloud.google.com/istio/docs/istio-on-gke/overview, the first managed offering of its kind, is an add-on for Google Kubernetes Engine (GKE) that installs and upgrades Istio’s components for you—no YAML required. With Istio on GKE, you can create a cluster with Istio pre-installed, or add Istio to an existing cluster.Installing Istio on GKE is easy, and can be done either through the Cloud Console or the command line. The add-on supports mutual TLS, meaning that with a single check-box, you can enforce end-to-end encryption for your service mesh.Once enabled, Istio on GKE provisions the Istio control plane for you, and enables Stackdriver integrations. You get to choose into which namespaces, if any, the Istio sidecar proxy is injected.Now that we have Istio installed on a GKE cluster, let’s explore how to use it with a real application. For this example, we’ll use the Hipster Shop demo, a microservices-based web application.While this sample app has multiple components, in this post we’ll focus on Product Catalog, which serves the list of products above. You can follow along in this post with the step-by-step tutorial here.Zero effort Stackdriver: Monitoring, logging, and tracingWhen you use Istio on GKE, the Stackdriver Monitoring API is provisioned automatically, along with an Istio adapter that forwards service mesh metrics to Stackdriver. This means that you have access to Istio metrics right away, alongside hundreds of existing GCP and GKE metrics.Stackdriver includes a feature called the Metrics Explorer, which allows you to use filters and aggregations together with Stackdriver’s built-in metrics to gain new insights into the behavior of your services. The example below shows an Istio metric (requests per second) grouped across each microservice in our sample application.You can add any Metrics Explorer chart to a new or existing Stackdriver Dashboard. Using Dashboards, you can also combine Istio metrics with your application metrics, giving you a more complete view into the status of your application.You can also use Stackdriver Monitoring to set SLOs using Istio metrics—for example, latency, or non-200 response codes. Then, you can set Stackdriver Policies against those SLOs to alert you when a policy reaches a failing threshold. In this way, Istio on GKE sets up your organization with SRE best practices, out of the box.Istio on GKE also makes tracing easy. With tracing, you can better understand how quickly your application is handling incoming requests, and identify performance bottlenecks. When Stackdriver Trace is enabled and you’ve instrumented tracing in your application, Istio automatically collects end-to-end latency data and displays it in real-time to the GCP Console.On the logging front, Stackdriver also creates a number of logs-based metrics. With logs-based metrics, you can extract latency information from log entries, or record the number of log entries that contain a particular message. You can also develop custom metrics to keep track of logs that are particularly important to your organization.Then, using the Logs Viewer, you can export the logs to Google Cloud data solutions, including Cloud Storage and BigQuery, for storage and further analysis.Traffic management and visualizationIn addition to providing visibility into your service mesh, Istio supports fine-grained, rule-based traffic management. These features give you control over how traffic and API calls flow between your services.As the first post in this series explains, adopting a service mesh lets you decouple your applications from the network. And unlike Kubernetes services, where load balancing is tethered to the number of running pods, Istio allows you to decouple traffic flow from infrastructure scaling through granular percentage-based routing.Let’s run through a traffic routing example, using a canary deployment.A canary deployment routes a small percentage of traffic to a new version of a microservice, then allows you to gradually roll it out to the whole user base, while phasing out and retiring the old version. If something goes wrong during this process, traffic can be switched back to the old version.In this example, we create a new version of the ProductCatalog microservice. The new version (“v2″) is deployed to Kubernetes alongside the working (“v1″) deployment.Then, we create an Istio VirtualService (traffic rule) that sends 25% of ProductCatalog traffic to v2. We can deploy this rule to the Kubernetes cluster, alongside our application. With this policy, no matter how much production traffic goes to ProductCatalog—and how many pods scale up as a result—Istio ensures that the right percentage of traffic goes to the specified version of ProductCatalog.We’ll also use another feature of Istio and Envoy: for demo purposes, we inject a three-second latency into all ProductCatalog v2 requests.Once the canary version is deployed to GKE, we can open Metrics Explorer to see how ProductCatalog v2 is performing. Notice that we are looking at the Istio Server Response Latency metric, and we have grouped by “destination workload name”—this tells us the time it takes for each service to respond to requests.Here, we can see ProductCatalog v2’s injected three-second latency. From here, it’s easy to roll back from v2 to v1. We can do this by updating the Istio VirtualService to return 100% of traffic to v1, then deleting the v2 Kubernetes deployment.Although this example demonstrates a manual canary deployment, often you’ll want to automate the process of promoting a canary deployment: increasing traffic percentages, and scaling down the old version. Open-source tools like Flagger can help automate percentage-based traffic shifting for Istio.Istio supports many other traffic management rules beyond traffic splitting, including content-based routing, timeout and retries, circuit breaking, and traffic mirroring for testing in production. Like in this canary example, these rules can be defined with the same declarative Istio building blocks.We hope this example gives you a taste of how, together, Istio and Stackdriver help simplify complex traffic management operations.What’s next?To get some more hands-on experience with Istio on GKE, check out the companion demo. You can find the instructions for getting started on GitHub.To read more about Istio, Stackdriver, and traffic management, see:Drilling down into Stackdriver Service Monitoring (GCP blog)Incremental Istio Part 1, Traffic Management (Istio blog)Stay tuned for the next post, which will be all about security with Istio.
Quelle: Google Cloud Platform

Introducing WebSockets support for App Engine Flexible Environment

Do you have an application that could benefit from being able to stream data from the app to the client with minimal latency—without the client having to poll for updates? Today, we are excited to announce that App Engine Flexible Environment now supports the WebSocket protocol in beta—the first time that App Engine supports a streaming protocol. Many users have been looking forward to this feature, as this capability is useful in a number of scenarios, including:Real time event updates, such as sports scores, stock market prices, etc.User notifications, such as software updates, or content updatesChat applicationsCollaborative editing toolsMultiplayer gamesFeeds, such as social media and newsWebSockets is available to your App Engine Flexible Environment application with no special setup. Take a look at our documentation to learn more: Python | Java | Nodejs.For clients that don’t support WebSockets, some libraries like socket.io fall back on HTTP long polling. To help you achieve better performance in these cases, we have also added a new “session affinity” setting to app.yaml that allows requests from a single client to be preferentially sent to the same App Engine instance. You should only use session affinity for performance optimization and continue to store application state in a persistent way outside the instance memory since App Engine instances are all periodically restarted.Our alpha customers are already using WebSockets in production. Shine is a French provider of mobile banking services, and has implemented WebSockets across several parts of its platform.”We use WebSockets in App Engine Flex to exchange information like banking transactions, user profiles or user metadata between our front-end and back-end. It has worked perfectly for us for several months, was easy to set up and has significantly reduced latency and consumed bandwidth.” – Raphaël Simon, ShineSupport for WebSockets is in beta today and we look forward to making it generally available soon. Check out App Engine and try the new WebSocket protocol today!
Quelle: Google Cloud Platform

Last month today: GCP in January

We kicked off the new year on the Google Cloud Platform (GCP) blog with lots to say about various services across cloud, including developing applications to run on cloud, planning and executing a cloud migration, and what to know about modern cloud security. Here are the most-read posts from January.  Prepare your infrastructure: Security trends and tips for 2019Along with the New Year’s resolutions and planning that January brings, there’s a good reminder here that the work of securing your user data and applications never ends. Google security experts have strong views about the security trends you need to know about in 2019, from the continued importance of two-factor authentication to concepts such as zero-trust. Take a look at all the security trends here.The topic of identity and access management in the cloud covers quite a lot of ground. But from a practical perspective, there are some typical use cases for authentication and identity management that you can use to decide which methods and Google Cloud products will meet your particular needs, whether it’s external users, API calls and more. See details and charts on using different authentication methods.Testing, planning and other cloud prepMigrating your workloads and data into the cloud is a great first step to taking advantage of all of cloud’s benefits. To help get you there, we published a cloud migration checklist from Velostrata that covers the considerations you should take into account when you’re deciding which VMs to move to cloud and when. These considerations might include the production or criticality level of the application, its compliance and regulatory status, and app integrations and dependencies. Check the post out for guidance and tips on migrating to cloud.Traffic navigation app Waze uses the Kayenta feature of the open-source Spinnaker deployment tool to do automated canary analysis, giving them advanced insight into how new app versions might perform in production. But why build your own canary deployment system when you can develop a pipeline that works for you with Spinnaker? Learn more about developing canary configurations, reports and metrics.Creating and honing the building blocks of cloudEnterprise software development becomes more difficult and expensive if businesses continually customize it to meet every user’s needs. That drives up costs. But the enterprise software delivery model could change quickly as concepts like serverless and tools like GCP’s Cloud Spanner database open the door to building “autonomous apps” that comprise multiple microservices. Get the whole story on the future of developing and delivering software.Developers want to write code in the language that is most familiar to them and their company, and for a lot of them, that means Go. As of January, Go 1.11 is a supported language on Cloud Functions, our serverless event-based compute platform. You can read about the key language features that are available in this version. Read the details of Go 1.11 on Cloud Functions.That’s a wrap for January. Stay tuned to the GCP blog for the latest cloud news and tips.
Quelle: Google Cloud Platform

10 tips for building long-running clusters using Cloud Dataproc

Google’s Cloud Dataproc is a fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simple, cost-efficient way. Google Cloud Platform (GCP) customers like Pandora and Outbrain depend on Cloud Dataproc to run their Hadoop and Spark jobs.A key differentiator for Cloud Dataproc is that it is optimized to create ephemeral job-scoped clusters in around 90 seconds. This speed of deployment means that a single job can have a dedicated cluster, containing just the resources needed to run the job, that is shut down upon job completion. On the Cloud Dataproc team, we’ve worked with countless customers who are creating clusters for their particular use cases. However, not all Hadoop and Spark workloads are appropriately served by an ephemeral job-scoped cluster model. Our goal on the Cloud Dataproc team is to make sure every customer’s use case can be addressed. To that end, we’re excited to share these tips and recommendations for using Cloud Dataproc in a non-ephemeral model.How Cloud Dataproc clusters work  If you’re just getting started, here’s a quick primer on how Cloud Dataproc works. When you use Cloud Dataproc to create clusters, you can have a seemingly limitless amount of computation running in parallel, since you have access to GCP’s global fleet of virtual machines. As a result, you don’t need to manage YARN queues and isolate runaway jobs like you would with Hadoop or Spark clusters running on-premises. In the diagram below, you can see a representation of how, for each job, Cloud Dataproc can deploy a cluster sized to match that job’s requirements.Specific features that make running job-scoped clusters in Cloud Dataproc fast, easy and cost-effective include the Jobs API, Workflow Templates API, Cluster Scheduled Deletion and Service Accounts.While the job-scoped cluster model has been effective for our Cloud Dataproc customers with batch processing and ETL/ELT jobs, we’ve heard that there are a variety of other use cases where balancing semi-long-running clusters alongside cloud capabilities is critical.Some scenarios for semi-long-running Cloud Dataproc clusters include:Interactive or ad hoc analysis, often through Cloud Dataproc web-based notebook optional components Jupyter and ZeppelinReporting/dashboarding applications that expose cross-database queries in PrestoData and SQL exploration tools added to Cloud Dataproc with init actions (for example, HUE)Streaming jobs such as those found in Apache Spark DStreams architectures or Beam on Flink deployments  Continually running jobs in workflow engines like OozieFrom these use cases and the conversations we have with customers, we see a pattern emerging for how teams want to shift long-running jobs to the cloud. There is a real desire to share the cluster among many users, which means 24/7 availability, while at the same time not being locked into the same confines that exist with on-premises Hadoop/Spark clusters. At its most basic, the long-running cloud cluster model that we hear customers want looks like this:In this model, the goal is to deploy a small cluster, submit the jobs as the end user, and then have the cluster dynamically scale up and down to meet demand within a predetermined budget. To make this easier, Cloud Dataproc recently made a smart autoscaler available in alpha, which examines pending and available YARN memory averaged over a configurable period of time to determine an exact number of workers to add or remove from the cluster. In addition, to provide enhanced user-level security within the cluster itself, the Kerberos Optional Component for Cloud Dataproc includes and configures the MIT distribution of Kerberos to provide user authentication, isolation, and encryption. In this model, High Availability (HA) modecan also be useful in the rare event of a Google Compute Engine failure of the master node. In HA mode, HDFS High Availability and YARN High Availability are configured to allow uninterrupted YARN and HDFS operations despite any single-node failures/reboots.Below, we have compiled a list of top 10 tips based on what we’ve learned from customers who have successfully built and continue to use semi-long-running Cloud Dataproc clusters. A common element to the success of these semi-long-running clusters is the practice of storing stateful data to GCP and then using Cloud Dataproc clusters for processing. The underlying theme behind all these tips is not building a single cluster that lives forever. Rather, use the automation and services of GCP to move the persistent data off the cluster. This is essential to letting you manage and scale compute resources independently of the data, apply the right tool to the job, and capture the value of the cloud, even for long-running clusters.While the focus on these tips is for Cloud Dataproc, many of the same techniques and concepts can be applied to running the Hortonworks Data Platform on Google Cloud.10 tips for building long-running Cloud Dataproc clustersUse Google Cloud Storage as your primary data source and sink Persist information on how to build your clustersIdentify a source control mechanism Externalize the Hive metastore database with Cloud SQL Use cloud authentication and authorization policies Know your way around Stackdriver Transform YARN queues into workflow templatesStart small and enable autoscaling Consolidate job history across multiple clustersTake advantage of GCP servicesTip 1. Use Google Cloud Storage as your primary data source and sinkThis tip is first because it is imperative to achieving “cloud freedom” for a cluster (i.e., severing dependencies between storage and compute). Luckily, this is often the easiest change to make, since converting from HDFS to Cloud Storage is usually as simple as a file prefix change (more on HDFS vs. Cloud Storage here).The bottom line is that you can’t scale with HDFS because storage is still tied to compute. Not only does HDFS couple a very pricey resource—compute—with a relatively inexpensive resource—storage—but it also siloes the data into a single cluster instead of exposing it to all the possibilities offered by GCP.There are still plenty of reasons to use a Cloud Dataproc storage device such as local SSDs, but the purposes should primarily be limited to ephemeral data such as scratch space, shuffle data, and LLAP cache.Tip 2. Persist information on how to build your clustersWhen creating a Cloud Dataproc cluster, there are several ways to represent the cluster itself as code. This lets you maintain a representation of the cluster even when it is not running.The first method of storing a cluster as code is by specifying initialization actions, which are executables or scripts that Cloud Dataproc will run on all nodes in the cluster as soon as it’s set up. Initialization actions often set up job dependencies, such as installing Python packages, so that when you do need to tear down a long-running cluster or re-create a cluster to update it to the latest version of Cloud Dataproc, you can recreate the environment automatically with the initialization action. Check out the Cloud Dataproc Github repository of initialization actions for examples.Alternatively, you can get started by creating a Cloud Dataproc custom image, which captures everything that’s installed on the disk. You can use that image across a variety of cluster shapes and sizes without having to write scripts to do the installation and configuration.In addition, to simply capture the information in the cluster configuration files, you can export a running cluster configuration to a YAML file. This same YAML can then be used as an input to the import command, which can create a new cluster with the same configuration.Tip 3. Identify a source control mechanismOnce data is securely stored in Cloud Storage and accessible to the cloud resources that you have identified, the next question we usually hear is “Where do I store my code?” A cloud migration is a great time to identify a source control mechanism that works for all the analytics users as well as the developers.While most Java developers are experienced in source control, often, in analytics environments, the server’s local file system ends up morphing into a code repository for things like SQL queries, Python scripts and notebook files. Often the solution to this problem is not as simple as “just use Git,” because many of the SQL interfaces are not as well-integrated with source controls as integrated development environments (IDEs) are. Even popular notebooks like Jupyter require additional tools like nbviewer to properly render the notebook’s interactive features.While a user’s folder on the cluster may have provided a way to get by in the past, users will get frustrated quickly, since the work left in these folders disappears as clusters become more ephemeral in the cloud. Cloud Dataproc has taken some steps to mitigate this. For example, when using the Jupyter optional component, Cloud Dataproc will automatically configure notebooks to be backed up with Cloud Storage, making the same notebook files accessible to other clusters. However, these mitigations should not be a substitute for a well-thought-out source control framework. Whatever framework you choose, be sure to include the scripts and code that you use to build your clusters as well.For more on source control that is directly integrated with GCP, check out Cloud Source Repositories.Tip 4. Externalize the Hive metastore database with Cloud SQLThe Hive metastore holds metadata about Hive tables, such as their schema and location, which in the cloud is usually an external table location of Cloud Storage. MySQL is commonly used as a back end for the Hive metastore. When you’re using GCP, Cloud SQL (our fully managed database service supporting MySQL or PostgreSQL) makes it easy to set up, maintain, manage, and administer that MySQL-based Hive backend. With Cloud Dataproc, you can use a Google-written and -maintained initialization action to set up and configure a Cloud Dataproc cluster to use an external Hive metastore database in Cloud SQL.Using Cloud SQL as the Hive metastore database makes it easy to discover metadata and makes it possible to attach multiple clusters to the same source of Hive metadata. Additionally, Ranger and Atlas policy data can be stored in this database, persisted, and used across many clusters.For a full tutorial on externalizing the Hive metastore database, see Using Apache Hive on Cloud Dataproc.Tip 5. Use cloud authentication and authorization policiesCloud authentication and authorization will vary greatly by company, and a full discussion is beyond the scope of this post. However, the principal point of this tip is that you should use the controls available in GCP’s Identity and Access Management as much as possible.A common misstep we see some customers make is trying to bring only the security controls maintained in their clusters when moving Hadoop workloads to the cloud. This can be problematic when clusters become ephemeral. To properly secure clusters, you need to control the permissions of other cloud services (such as Cloud Storage) and control who has the ability to build the clusters themselves.  An important first step is understanding Cloud Dataproc permissions and IAM roles. Cloud Dataproc separates permissions into two categories: clusters and jobs. Cluster permissions are for administrators building the clusters and jobs are for the developers who submit code to the cluster. Granular IAM can also be used to limit which cluster(s) users can perform which actions on.Using the OS Login feature of Compute Engine can also simplify how you manage SSH access to the Cloud Dataproc clusters. With OS Login enabled, the IAM permissions are automatically mapped to a Linux identity and there’s no longer a need to create SSH keys.In addition to Cloud IAM, Cloud Dataproc has a Kerberos Optional Component, which customers often use to extend Cloud IAM into the cluster itself or to use existing Active Directory-based sources of identity, as shown here:Each GCP user is associated with a cloud identity. This authentication mechanism gives users the ability to SSH into a cluster, run jobs via the API and to create cloud resources (i.e., a Cloud Dataproc cluster).When a cloud user wants to use a Kerberized “Hadoop” application, a Kerberos principal must be obtained. Microsoft Active Directory is used as a cross-realm trust to users and groups that map into Cloud Dataproc Kerberos principals.Note: This setup requires Active Directory to be source of truth for user identities. Cloud Identity is only a synchronized copy.  When the “Hadoop” application needs to obtain data from Cloud Storage, a Cloud Storage Connector is invoked. The Cloud Storage Connector allows “Hadoop” to access Cloud Storage data at the block level as if it were a native part of Hadoop. This connector relies on a Service Accounts to authenticate against Cloud Storage.Tip 6. Know your way around StackdriverStackdriver is the default way to persist an audit trail with ephemeral Cloud Dataproc resources.Stackdriver logging is used capture the daemon logs and YARN container logs from Cloud Dataproc clusters. This is in addition to Stackdriver monitoring, which collects and ingests metrics, events, and metadata from Cloud Dataproc clusters to generate insights via dashboards and charts. You can use Stackdriver to understand the performance and health across all Cloud Dataproc clusters at once and examine HDFS, YARN, and Cloud Dataproc job and operation metrics. If an issue is identified, Stackdriver makes it easy to drill into a specific cluster’s metrics.An easy way to get started is by going into Stackdriver Monitoring, opening Metrics Explorer, selecting the relevant resource for Cloud Dataproc and reviewing what is available.In the rare situation you would like to send additional metrics, Cloud Dataproc customers can build a custom metric by enabling it through the Cloud Dataproc cluster property “dataproc:dataproc.monitoring.stackdriver.enable” during the cluster creation.We’ve heard from customers that the Stackdriver integration is immensely helpful, and often that customers wish they had invested more up-front in building out their Stackdriver environment at the beginning of their cloud migration.Tip 7. Transform YARN queues into workflow templatesEven with long-running clusters, the idea of managing YARN queues should begin to fade as you migrate to cloud-native architecture designs. For best results, transition YARN queues into separate clusters with unique cluster shapes and potentially different permissions. As long as you have persisted the storage of long-lived elements off the cluster, you can use the same data, metadata, and permissioning systems across the various clusters.Workflow templates can make these right-sized clusters easier to configure, since they run in a single workload. A workflow template is a reusable workflow configuration. It defines a graph of jobs with information on where to run those jobs and ultimately allows you to iterate and right-size clusters by making lightweight tweaks.Learn more about getting started with Cloud Dataproc workflow templates.  Tip 8. Start small and enable autoscalingDon’t waste time trying to generate the perfect cluster configurations. Simply start with a small cluster that can autoscale to the needed (or allowable) size. The Cloud Dataproc autoscaler has a variety of settings to fine-tune how you would like the autoscaler to behave. An autoscaling cluster is a great way to have a small but available long-running cluster that can quickly become a large-scale data cruncher as soon as the workload requests it.For more on the customizations exposed by the Cloud Dataproc autoscaler, check out Autoscaling Clusters.Tip 9. Consolidate job history across multiple clustersWe hear that many customers want to persist job history information across multiple clusters. In Cloud Dataproc, you can achieve this by pointing the MapReduce and Spark job history servers at Cloud Storage. Override the MapReduce done directory and intermediate done directory and the Spark event log directory with Cloud Storage directories, like this:When specifying Cloud Storage directories, there are a few things to watch out for. The first is that you should ensure that all directories are a subdirectory within a bucket (gs://bucket/dir) as opposed to a top-level bucket (gs://bucket). The second is that the two Spark event log directories must match exactly. Also note that you must manually create an empty Spark event log directory before running any Spark jobs. Unlike the MapReduce job history directories, this directory is not created automatically. Finally, when creating a cluster, make sure that all properties include the appropriate prefix: mapred: for MapReduce properties, spark: for Spark properties.This technique, while useful, does come with a couple of potential pitfalls. The first is that because the MapReduce job history server periodically moves files from the intermediate done directory to the done directory, a job may finish before its history files have been moved. Make sure that the history files for a job have actually been moved before terminating a cluster if you need to have the complete job history.The second caveat is that the MapReduce job history server only reads history from Cloud Storage when it first starts up. From that point forward, the only new job history you will see in the UI is for the jobs that were run on that cluster; in other words, these are jobs whose history was moved by that cluster’s job history server. By contrast, the Spark job history server is completely stateless, so the previous caveats do not apply.To avoid these pitfalls, try an architecture in which the job history servers are run on a single-node cluster. Create a single-node cluster that has the above four properties configured to point to Cloud Storage. Then, when creating additional clusters, point them at the job history servers on your single-node cluster by setting the above four properties along with the following additional two properties:Note that if you take this approach, you should consider running an initialization action on your additional clusters to disable their job history servers, which should no longer be used. This might look something like:systemctl stop hadoop-mapreduce-historyserversystemctl stop spark-history-serverIf you run MapReduce and Spark jobs on any of your clusters, you should be able to see the job history for all of your clusters in one place by accessing the web UIs of your single-node cluster (port 19888 for the MapReduce job history, port 18080 for the Spark job history). For more information on accessing web UIs, check out Cluster web interfaces.Tip 10. Take advantage of GCP services”Hadoop” has become a catch-all term for frameworks that run open source big data software in a somewhat standardized way. The phrase Hadoop has become synonymous with software such as Spark, Presto and Kafka (to name a few), despite having only loose ties to the Hadoop Distributed File System (HDFS) and MapReduce frameworks that were the original Hadoop applications. GCP is an alternative big data stack. So knowing how to properly map GCP data and analytics services to long-lived Hadoop workloads and applications is essential to extract value from the cloud.For example, consider migrating HBase to Cloud Bigtable if you don’t require co-processors or the SQL capabilities of Apache Phoenix. If you have data analytics well-versed in SQL, consider BigQuery as an alternative to Hive. If you are using Spark with scikit-learn, consider Cloud Machine Learning Engine as a way to not only train your model but easily move it to production.Get started with Cloud Dataproc today with a tutorial or for more information, get in touch.
Quelle: Google Cloud Platform

Improving the developer experience with the enhanced Apigee Developer Portal

Part and parcel of modern enterprise development is building APIs that enable you to expose your services to developers both inside and outside your organization. But just building APIs isn’t enough. Getting APIs and API programs to market successfully hinges on convincing your developers to actually use them. And the key driver of getting developers to adopt and consume APIs, both within a company or among the wider developer community, is the developer portal.To help enterprises create great developer experiences, we’re announcing several enhancements to the Apigee Developer Portal, a comprehensive, customizable solution that helps API providers seamlessly onboard developers and admins who use APIs managed by Google Cloud’s Apigee platform. Here is what’s included in this round of updates:A new version of SmartDocs API reference documentationAn enhanced theme editor and redesigned default portal themeImprovements to managing developer accountsSmartDocsApigee’s SmartDocs automatically creates beautiful API reference documentation for your developers, and features a new, three-pane view. The left pane helps developers navigate between areas of the API, while the center pane gives detailed documentation for a given operation. The right pane enables you to make API requests directly from the docs, and it includes an “expand” button so you can focus on the details of the request itself.Documentation is built upon on the OpenAPI Specification and supports both versions 2.0 and 3.0.x. Every operation defined in the OpenAPI spec gets its own page, which makes it easy for users to share and discuss specific areas of the docs and for your API team to deep-link users to the exact content they need.Theme editor Along with SmartDocs, we’ve enhanced the default theme using Google’s Material Design toolkit. The integrated tool for creating portal themes now supports SCSS and Angular Material Themes, which introduce variables, rules, and other powerful features.Account managementLastly, this release of the Apigee Developer Portal improves how developers create and manage accounts and gives administrators new views for managing the users of their developer portals. API providers can now view and manage all registered user accounts, and configure automatic or manual approval for new user accounts in the list of users of the API portal admin interface. This view also lets API providers see details for all registered user accounts, view custom account registration fields, and approve, block, and delete user accounts.Getting startedTo learn more about this launch and view a demo of the latest features, please join our upcoming webcast, “How to create world-class developer experiences” on Thursday, Feb. 14.If you’re already an Apigee Edge cloud customer, check out our latest documentation to get started. There you’ll find a complete feature overview, guided tutorials, FAQs, and more.If you’re not already an Apigee Edge customer, try it for free.
Quelle: Google Cloud Platform

Creating intelligent enterprise applications using GCP services on SAP Cloud Platform

SAP Cloud Platform is a Platform-as-a-Service offering that lets SAP customers extend SAP solutions running in the cloud or on-premises, integrate applications and data, as well as to develop and run new cloud-native applications. SAP Cloud Platform’s Cloud Foundry environment is a multi-cloud offering, providing you with the option to provision open-source backing services from SAP as well as services from third-party providers.To make it easier to connect to Google Cloud Platform (GCP) services from SAP Cloud Platform’s Cloud Foundry environment, we collaborated closely with SAP on an integration guide that includes how to enable GCP services in SAP Cloud Platform Marketplace using GCP’s Open Service Broker.Connecting to GCP services in the SAP Cloud Platform Marketplace follows standard Cloud Foundry patterns with Service Brokers: you create a service, and then bind the service to one or many applications using the Cloud Foundry service broker APIs and CLI.The below picture details the architecture and steps required to access GCP services natively on the SAP Cloud Platform Cloud Foundry environment.Implementing a broker service allows you to focus on the services you need to get your job done without having to know how the services are built or worrying about the infrastructure you need to run them.The Open Service Broker gives you access to a wide range of GCP services such as storage, big data, machine learning, monitoring and debugging that you can natively incorporate into your applications. And we’ll add more GCP services to the Open Service Broker as we move forward, opening up a whole new range of services for your applications.We want to make Google Cloud the best place to run SAP applications. We’ll continue listening to your feedback, and we’ll have more updates to share in the coming months. In the meantime, you can learn more about SAP solutions on GCP by visiting our website.
Quelle: Google Cloud Platform

NoSQL for the serverless age: Announcing Cloud Firestore general availability and updates

As modern application development moves away from managing infrastructure and toward a serverless future, we’re pleased to announce the general availability of Cloud Firestore, our serverless, NoSQL document database. We’re also making it available in 10 new locations to complement the existing three, announcing a significant price reduction for regional instances, and enabling integration with Stackdriver for monitoring.Cloud Firestore is a fully managed, cloud-native database that makes it simple to store, sync, and query data for web, mobile, and IoT applications. It focuses on providing a great developer experience and simplifying app development with live synchronization, offline support, and ACID transactions across hundreds of documents and collections. Cloud Firestore is integrated with both Google Cloud Platform (GCP) and Firebase, Google’s mobile development platform. You can learn more about how Cloud Firestore works with Firebase here. With Cloud Firestore, you can build applications that move swiftly into production, thanks to flexible database security rules, real-time capabilities, and a completely hands-off auto-scaling infrastructure.Cloud Firestore does more than just core database tasks. It’s designed to be a complete data backend that handles security and authorization, infrastructure, edge data storage, and synchronization. Identity and Access Management (IAM) and Firebase Auth are built in to help make sure your application and its data remain secure. Tight integration with Cloud Functions, Cloud Storage, and Firebase’s SDK accelerates and simplifies building end-to-end serverless applications. You can also easily export data into BigQuery for powerful analysis, post-processing of data, and machine learning.Building with Cloud Firestore means your app can seamlessly transition from online to offline and back at the edge of connectivity. This helps lead to simpler code and fewer errors. You can serve rich user experiences and push data updates to more than a million concurrent clients, all without having to set up and maintain infrastructure. Cloud Firestore’s strong consistency guarantee helps to minimize application code complexity and reduces bugs. A client-side application can even talk directly to the database, because enterprise-grade security is built right in. Unlike most other NoSQL databases, Cloud Firestore supports modifying up to 500 collections and documents in a single transaction while still automatically scaling to exactly match your workload.What’s new with Cloud FirestoreNew regional instance pricing. This new pricing takes effect on March 3, 2019 for most regional instances, and is as low as 50% of multi-region instance prices.Data in regional instances is replicated across multiple zones within a region. This is optimized for lower cost and lower write latency. We recommend multi-region instances when you want to maximize the availability and durability of your database.SLA now available. You can now take advantage of Cloud Firestore’s SLA: 99.999% availability for multi-region instances and 99.99% availability for regional instances.New locations available. There are 10 new locations for Cloud Firestore:Multi-regionEurope (eur3)North America (Regional)Los Angeles (us-west2)Montréal (northamerica-northeast1)Northern Virginia (us-east4)South America (Regional)São Paulo (southamerica-east1)Europe (Regional)London (europe-west2)Asia (Regional)Mumbai (asia-south1)Hong Kong (asia-east2)Tokyo (asia-northeast1)Australia (Regional)Sydney (australia-southeast1)Cloud Firestore is now available in 13 regions.Stackdriver integration (in beta). You can now monitor Cloud Firestore read, write and delete operations in near-real time with Stackdriver.More features coming soon. We’re working on adding some of the most requested features to Cloud Firestore from our developer community, such as querying for documents across collections and incrementing database values without needing a transaction.As the next generation of Cloud Datastore, Cloud Firestore is compatible with all Cloud Datastore APIs and client libraries. Existing Cloud Datastore users will be live-upgraded to Cloud Firestore automatically later in 2019. You can learn more about this upgrade here.Adding flexibility and scalability across industriesCloud Firestore is already changing the way companies build apps in media, IoT, mobility, digital agencies, real estate, and many others. The unifying themes among these workloads include: the need for mobility even when connectivity lapses, scalability for many users, and the ability to move quickly from prototype to production. Here are a few of the stories we’ve heard from Cloud Firestore users.When opportunity strikes…In the highly competitive world of shared, on-demand personal mobility via cars, bikes, and scooters, the ability to deliver a differentiated user experience, iterate rapidly, and scale are critical. The prize is huge. Skip provides a scooter-sharing system where shipping fast can have a big impact. Mike Wadhera, CTO and Co-founder, says, “Cloud Firestore has enabled our engineering and product teams to ship at the clock-speed of a startup while leveraging Google-scale infrastructure. We’re delighted to see continued investment in Firebase and the broader GCP platform.”Another Cloud Firestore user, digital consultancy The Nerdery, has to deliver high-quality results in a short period of time, often needing to integrate with existing third-party data sources. They can’t build up and tear down complicated, expensive infrastructure for every client app they create. “Cloud Firestore was a great fit for the web and mobile applications we built because it required a solution to keep 40,000-plus users apprised of real-time data updates,” says Jansen Price, Principal Software Architect. “The reliability and speed of Cloud Firestore coupled with its real-time capabilities allowed us to deliver a great product for the Google Cloud Next conferences.”Reliable information deliveryIncident response company Now IMS uses real-time data to keep citizens safe in crowded places, where cell service can get spotty when demand is high. “As an incident management company, real-time and offline capabilities are paramount to our customers,” says John Rodkey, Co-founder. “Cloud Firestore, along with the Firebase Javascript SDK, provides us with these capabilities out of the box. This new 100% serverless architecture on Google Cloud enables us to focus on rapid application development to meet our customers’ needs instead of worrying about infrastructure or server management like with our previous cloud.”Regardless of the app, users want the latest information right away, without having to click refresh. The QuintoAndar mobile application connects tenants and landlords in Brazil for easier apartment rentals. “Being able to deliver constantly changing information to our customers allows us to provide a truly engaging experience. Cloud Firestore enables us to do this without additional infrastructure and allows us to focus on the core challenges of our business,” says Guilherme Salerno, Engineering Manager at QuintoAndar.Real-time, responsive apps, happy usersFamed broadsheet and media company The Telegraph uses Cloud Firestore so registered users can easily discover and engage with relevant content. The Telegraph wanted to make the user experience better without having to become infrastructure experts in serving and managing data to millions of concurrent connections. “Cloud Firestore allowed us to build a real-time personalized news feed, keeping readers informed with synchronized content state across all of their devices,” says Alex Mansfield-Scaddan, Solution Architect. “It allowed The Telegraph engineering teams to focus on improving engagement with our customers, rather than becoming real-time database and infrastructure experts.”On the other side of the Atlantic, The New York Times used Cloud Firestore to build a feature in The Times’ mobile app to send push notifications updated in real time for the 2018 Winter Olympics. In previous approaches to this feature, scaling had been a challenge. The team needed to track each reader’s history of interactions in order to provide tailored content for particular events or sports. Cloud Firestore allowed them to query data dynamically, then send the real-time updates to readers. The team was able to send more targeted content faster.Delivering powerful edge storage for IoT devicesAthlete testing technology company Hawkin Dynamics was an early, pre-beta adopter of Cloud Firestore. Their pressure pads are used by many professional sports teams to measure and track athlete performance. In the fast-paced, high-stakes world of professional sports, athletes can’t wait around for devices to connect or results to calculate. They demand instant answers even if the WiFi is temporarily down. Hawkin Dynamics uses Cloud Firestore to bring real-time data to athletes through their app dashboard, shown below.“Our core mission at Hawkin Dynamics is to help coaches make informed decisions regarding their athletes through the use of actionable data. With real-time updates, our users can get the data they need to adjust an athlete’s training on a moment-by-moment basis,” says Chris Wales, CTO. “By utilizing the powerful querying ability of Cloud Firestore, we can provide them the insights they need to evaluate the overall efficacy of their programs. The close integrations with Cloud Functions and the other Firebase products have allowed us to constantly improve on our product and stay responsive to our customers’ needs. In an industry that is rapidly changing, the flexibility afforded to us by Cloud Firestore in extending our applications has allowed us to stay ahead of the game.”Getting started with Cloud FirestoreWe’ve heard from many of you that Cloud Firestore is helping solve some of your most timely development challenges by simplifying real-time data and data synchronization, eliminating server-side code, and providing flexible yet secure database authentication rules. This reflects the state of the cloud app market, where developers are exploring lots of options to help them build better and faster while also providing modern user experiences. This glance at Stack Overflow questions gives a good picture of some of these trends, where Cloud Firestore is a hot topic among cloud databases.Source: StackExchangeWe’ve seen close to a million Cloud Firestore databases created since its beta launch. The platform is designed to serve databases ranging in size from kilobytes to multiple petabytes of data. Even a single application running on Cloud Firestore is delivering more than 1 million real-time updates per second to users. These apps are just the beginning. To learn more about serverless application development, take a look through the archive of the recent application development digital conference.We’d love to hear from you, and we can’t wait to see what you build next. Try Cloud Firestore today for your apps.
Quelle: Google Cloud Platform

How we built a derivatives exchange with BigQuery ML for Google Next ‘18

Financial institutions have a natural desire to predict the volume, volatility, value or other parameters of financial instruments or their derivatives, to manage positions and mitigate risk more effectively. They also have a rich set of business problems (and correspondingly large datasets) to which it’s practical to apply machine learning techniques.Typically, though, in order to start using ML, financial institutions must first hire data scientist talent with ML expertise—a skill set for which recruiting competition is high. In many cases, an organization has to undertake the challenge and expense of bootstrapping an entire data science practice. This summer, we announced BigQuery ML, a set of machine learning extensions on top of our scalable data warehouse and analytics platform. BigQuery ML effectively democratizes ML by exposing it via the familiar interface of SQL—thereby letting financial institutions accelerate their productivity and maximize existing talent pools.As we got ready for Google Cloud Next London last summer, we decided to build a demo to showcases BigQuery ML’s potential for the financial services community. In this blog post, we’ll walk through how we designed the system, selected our time-series data, built an architecture to analyze six months of historical data, and quickly trained a model to outperform a ‘random guess’ benchmark—all while making predictions in close to real time.Meet the Derivatives ExchangeA team of Google Cloud solution architects and customer engineers built the Derivatives Exchange in the form of an interactive game, in which you can opt to either rely on luck, or use predictions from a model running in BigQuery ML, in order to decide which options contracts will expire in-the-money. Instead of using the value of financial instruments as the “underlying” for the options contracts, we used the volume of Twitter posts (tweets) for a particular hashtag within a specific timeframe. Our goal was to show the ease with which you can deploy machine learning models on Google Cloud to predict an instrument’s volume, volatility, or value.The Exchange demo, as seen at Google Next ‘18 LondonOur primary goal was to translate an existing and complex trading prediction process into a simple illustration to which users from a variety of industries can relate. Thus, we decided to:Use the very same Google Cloud products that our customers use daily.Present a time-series that is familiar to everyone—in this case, the number of hashtag Tweets observed in a 10-minute window as the “underlying” for our derivative contracts.Build a fun, educational, and inclusive experience.When designing the contract terms, we used this Twitter time-series data in a manner similar to the strike levels specified in weather derivatives.Architectural decisionsSolution architecture diagram: the social media options marketWe imagined the exchange as a retail trading pit where, using mobile handsets, participants purchase European binary range call option contracts across various social media single names (what most people would think of as hashtags). Contracts are issued every ten minutes and expire after ten minutes. At expiry, the count of accumulated #hashtag mentions for the preceding window is used to determine which participants were holding in-the-money contracts, and their account balances are updated accordingly. Premiums are collected upon opening interest in a contract, and are refunded if the contract strikes in-the-money. All contracts pay out 1:1.We chose the following Google Cloud products to implement the demo:Compute Engine served as our job server:The implementation executes periodic tasks for issuing, expiring, and settling contracts. The design also requires a singleton process to run as a daemon to continually ingest tweets into BigQuery. We decided to consolidate these compute tasks into an ephemeral virtual machine on Compute Engine. The job server tasks were authored with node.js and shell scripts, using cron jobs for scheduling, and configured by an instance template with embedded VM startup scripts, for flexibility of deployment. The job server does not interact with any traders on the system, but populates the “market operational database” with both participant and contract status.Cloud Firestore served as our market operational database:Cloud Firestore is a document-oriented database that we use to store information on market sessions. It serves as a natural destination for the tweet count and open interest data displayed by the UI, and enables seamless integration with the front end.Firebase and App Engine provided our mobile and web applications:Using the Firebase SDK for both our mobile and web applications’ interfaces enabled us to maintain a streamlined codebase for the front end. Some UI components (such as the leaderboard and market status) need continual updates to reflect changes in the source data (like when a participant’s interest in a contract expires in-the-money). The Firebase SDK provides concise abstractions for developers and enables front-end components to be bound to Cloud Firestore documents, and therefore to update automatically whenever the source data changes.Choosing App Engine to host the front-end application allowed us to focus on UI development without the distractions of server management or configuration deployment. This helped the team rapidly produce an engaging front end.Cloud Functions ran our backend API services:The UI needs to save trades to Cloud Firestore, and Cloud Functions facilitate this serverlessly. This serverless backend means we can focus on development logic, rather than server configuration or schema definitions, thereby significantly reducing the length of our development iterations.BigQuery and BigQuery ML stored and analyzed tweetsBigQuery solves so many diverse problems that it can be easy to forget how many aspects of this project it enables. First, it reliably ingests and stores volumes of streaming Twitter data at scale and economically, with minimal integration effort. The daemon process code for ingesting tweets consists of 83 lines of Javascript, with only 19 of those lines pertaining to BigQuery.Next, it lets us extract features and labels from the ingested data, using standard SQL syntax. Most importantly, it brings ML capabilities to the data itself with BigQuery ML, allowing us to train a model on features extracted from the data, ultimately exposing predictions at runtime by querying the model with standard SQL.BigQuery ML can help solve two significant problems that the financial services community faces daily. First, it brings predictive modeling capabilities to the data, sparing the cost, time and regulatory risk associated with migrating sensitive data to external predictive models. Second, it allows these models to be developed using common SQL syntax, empowering data analysts to make predictions and develop statistical insights. At Next ‘18 London, one attendee in the pit observed that the tool fills an important gap between data analysts, who might have deep familiarity with their particular domain’s data but less familiarity with statistics; and data scientists, who possess expertise around machine learning but may be unfamiliar with the particular problem domain. We believe BigQuery ML helps address a significant talent shortage in financial services by blending these two distinct roles into one.Structuring and modeling the dataOur model training approach is as follows:First, persist raw data in the simplest form possible: filter the Twitter Enterprise API feed for tweets containing specific hashtags (pulled from a pre-defined subset), and persist a two-column time-series consisting of the specific hashtag as well as the timestamp of that tweet as it was observed in the Twitter feed.Second, define a view in SQL that sits atop the main time-series table and extracts features from the raw Twitter data. We selected features that allow the model to predict the number of tweet occurrences for a given hashtag within the next 10-minute period. Specifically:#Hashtag#fintech may have behaviors distinct from #blockchain and distinct from #brexit, so the model should be aware of this as a feature.Day of weekSunday’s tweet behaviors will be different from Thursday’s tweet behaviors.Specific intra-day windowWe sliced a 24-hour day into 144 10-minute segments, so the model can inform us on trend differences between various parts of the 24-hour cycle.Average tweet count from the past hourThese values are calculated by the view based upon the primary time-series data.Average tweet velocity from the past hourTo predict future tweet counts accurately, the model should know how active the hashtag has been in the prior hour, and whether that activity was smooth (say, 100 tweets consistently for each of the last six 10-minute windows) or bursty (say, five 10-minute windows with 0 tweets followed by one window with 600 tweets).Tweet count rangeThis is our label, the final output value that the model will predict. The contract issuance process running on the job server contains logic for issuing options contracts with strike ranges for each hashtag and 10-minute window (Range 1: 0-100, Range 2: 101-250, etc.) We took the large historical Twitter dataset and, using the same logic, stamped each example with a label indicating the range that would have been in-the-money. Just as equity option chains issued on a stock are informed by the specific stock’s price history, our exchange’s option chains are informed by the underlying hashtag’s volume history.Train the model on this SQL view. BigQuery ML makes model training an incredibly accessible exercise. While remaining inside the data warehouse, we use a SQL statement to declare that we want to create a model trained on a particular view containing the source data, and using a particular column as a label.Finally, deploy the trained model in production. Again using SQL, simply query the model based on certain input parameters, just as you would query any table.Trading options contractsTo make the experience engaging, we wanted to recreate a bit of the open-outcry pit experience by having multiple large “market data” screens for attendees (the trading crowd) to track contract and participant performance. Demo participants used Pixel 2 handsets in the pit to place orders using a simple UI, from which they could allocate their credits to any or all of the three hashtags. When placing their order, they chose between relying on their own forecast, or using the predictions of a BigQuery ML model for their specific options portfolio, among the list of contracts currently trading in the market. Once the trades were made for their particular contracts, they monitored how their trades performed compared to other “traders” in real-time, then saw how accurate the respective predictions were when the trading window closed at expiration time (every 10 minutes).ML training processIn order to easily generate useful predictions about tweet volumes, we use a three-part process, First, we store tweet time series data to a BigQuery table. Second, we layer views are layered on top of this table to extract the features and labels required for model training. Finally, we use BigQuery ML to train and get predictions from the model.The canonical list of hashtags to be counted is stored within a BigQuery table named “hashtags”. This is joined with the “tweets” table to determine aggregates for each time window.Example 1: Schema definition for the “hashtags” table1. Store tweet time series data The tweet listener writes tags, timestamps, and other metadata to a BigQuery table named “tweets” that possesses the schema listed in example 2:Example 2: Schema definition for the “tweets” table2. Extract features via layered viewsThe lowest-level view calculates the count of each hashtag’s occurrence, per intraday window. The mid-level view extracts the features mentioned in the above section (“Structuring and modeling the data”). The top-level view then extracts the label (i.e., the “would-have-been in-the-money” strike range) from that time-series data. Lowest-level view The lowest-level view is defined by the SQL in example 3. The view definition contains logic to aggregate tweet history into 10-minute buckets (with 144 of these buckets per 24-hour day) by hashtag.Example 3: low-level view definitionb. Intermediate viewThe selection of some features (for example: hashtag, day-of-week or specific intraday window) is straightforward, while others (such as average tweet count and velocity for the past hour) are more complex. The SQL in example 4 illustrates these more complex feature selections.Example 4: intermediate view definition for adding featuresc. Highest-level viewHaving selected all necessary features in the prior view, it’s time to select the label. The label should be the strike range that would have been in-the-money for a given historical hashtag and ten-minute-window. The application’s “Contract Issuance” batch job generates strike ranges for every 10-minute window, and its “Expiration and Settlement” job determines which contract (range) struck in-the-money. When labeling historical examples for model training, it’s critical to apply this exact same application logic.Example 5: highest level view3. Train and get predictions from modelHaving created a view containing our features and label, we refer to the view in our BigQuery ML model creation statement:Example 6: model creationThen, at the time of contract issuance, we execute a query against the model to retrieve a prediction as to which contract will be in-the-money.Example 7: SELECTing predictions FROM the modeImprovementsThe exchange was built with a relatively short lead time, hence there were several architectural and tactical simplifications made in order to realistically ship on schedule. Future iterations of the exchange will look to implement several enhancements, such as:Introduce Cloud Pub/Sub into the architectureCloud Pub/Sub is an enabler for refined data pipeline architectures, and it stands to improve several areas within the exchange’s solution architecture. For example, it would reduce the latency of reported tweet counts by allowing the requisite components to be event-driven rather than batch-oriented.Replace VM `cron` jobs with Cloud SchedulerThe current architecture relies on Linux `cron`, running on a Compute Engine instance, for issuing and expiring options contracts, which contributes to the net administrative footprint of the solution. Launched in November of last year (after the version 1 architecture had been deployed), Cloud Scheduler will enable the team to provide comparable functionality with less infrastructural overhead.Reduce the size of the code base by leveraging Dataflow templatesOften, solutions contain non-trivial amounts of code responsible for simply moving data from one place to another, like persisting Pub/Sub messages to BigQuery. Cloud Dataflow templates allow development teams to shed these non-differentiating lines of code from their applications and simply configure and manage specific pipelines for many common use cases. Expand the stored attributes of ingested tweetsStoring the geographical tweet origins and the actual texts of ingested tweets could provide a richer basis from which future contracts may be defined. For example, sentiment analysis could be performed on the Tweet contents for particular hashtags, thus allowing binary contracts to be issued pertaining to the overall sentiment on a topic.Consider BigQuery user-defined functions (UDFs) to eliminate duplicate code among batch jobs and model executionCertain functionality, such as the ability to nimbly deal with time in 10-minute slices, is required by multiple pillars of the architecture, and resulted in the team deploying duplicate algorithms in both SQL and Javascript. With BigQuery UDFs, the team can author the algorithm once, in Javascript, and leverage the same code assets in both the Javascript batch processes as well as in the BigQuery ML models.A screenshot of the exchange dashboard during a trading sessionIf you’re interested in learning more about BigQuery ML, check out our documentation, or more broadly, have a look at our solutions for the financial services industry, or check out this interactive BigQuery ML walkthrough video. Or, if you’re able to attend Google Next ‘19 in San Francisco, you can even try out the exchange for yourself.
Quelle: Google Cloud Platform