Januar 2019 - Seite 4 von 89 - Cloud Computing Köln

Financial institutions have a natural desire to predict the volume, volatility, value or other parameters of financial instruments or their derivatives, to manage positions and mitigate risk more effectively. They also have a rich set of business problems (and correspondingly large datasets) to which it’s practical to apply machine learning techniques.Typically, though, in order to start using ML, financial institutions must first hire data scientist talent with ML expertise—a skill set for which recruiting competition is high. In many cases, an organization has to undertake the challenge and expense of bootstrapping an entire data science practice. This summer, we announced BigQuery ML, a set of machine learning extensions on top of our scalable data warehouse and analytics platform. BigQuery ML effectively democratizes ML by exposing it via the familiar interface of SQL—thereby letting financial institutions accelerate their productivity and maximize existing talent pools.As we got ready for Google Cloud Next London last summer, we decided to build a demo to showcases BigQuery ML’s potential for the financial services community. In this blog post, we’ll walk through how we designed the system, selected our time-series data, built an architecture to analyze six months of historical data, and quickly trained a model to outperform a ‘random guess’ benchmark—all while making predictions in close to real time.Meet the Derivatives ExchangeA team of Google Cloud solution architects and customer engineers built the Derivatives Exchange in the form of an interactive game, in which you can opt to either rely on luck, or use predictions from a model running in BigQuery ML, in order to decide which options contracts will expire in-the-money. Instead of using the value of financial instruments as the “underlying” for the options contracts, we used the volume of Twitter posts (tweets) for a particular hashtag within a specific timeframe. Our goal was to show the ease with which you can deploy machine learning models on Google Cloud to predict an instrument’s volume, volatility, or value.The Exchange demo, as seen at Google Next ‘18 LondonOur primary goal was to translate an existing and complex trading prediction process into a simple illustration to which users from a variety of industries can relate. Thus, we decided to:Use the very same Google Cloud products that our customers use daily.Present a time-series that is familiar to everyone—in this case, the number of hashtag Tweets observed in a 10-minute window as the “underlying” for our derivative contracts.Build a fun, educational, and inclusive experience.When designing the contract terms, we used this Twitter time-series data in a manner similar to the strike levels specified in weather derivatives.Architectural decisionsSolution architecture diagram: the social media options marketWe imagined the exchange as a retail trading pit where, using mobile handsets, participants purchase European binary range call option contracts across various social media single names (what most people would think of as hashtags). Contracts are issued every ten minutes and expire after ten minutes. At expiry, the count of accumulated #hashtag mentions for the preceding window is used to determine which participants were holding in-the-money contracts, and their account balances are updated accordingly. Premiums are collected upon opening interest in a contract, and are refunded if the contract strikes in-the-money. All contracts pay out 1:1.We chose the following Google Cloud products to implement the demo:Compute Engine served as our job server:The implementation executes periodic tasks for issuing, expiring, and settling contracts. The design also requires a singleton process to run as a daemon to continually ingest tweets into BigQuery. We decided to consolidate these compute tasks into an ephemeral virtual machine on Compute Engine. The job server tasks were authored with node.js and shell scripts, using cron jobs for scheduling, and configured by an instance template with embedded VM startup scripts, for flexibility of deployment. The job server does not interact with any traders on the system, but populates the “market operational database” with both participant and contract status.Cloud Firestore served as our market operational database:Cloud Firestore is a document-oriented database that we use to store information on market sessions. It serves as a natural destination for the tweet count and open interest data displayed by the UI, and enables seamless integration with the front end.Firebase and App Engine provided our mobile and web applications:Using the Firebase SDK for both our mobile and web applications’ interfaces enabled us to maintain a streamlined codebase for the front end. Some UI components (such as the leaderboard and market status) need continual updates to reflect changes in the source data (like when a participant’s interest in a contract expires in-the-money). The Firebase SDK provides concise abstractions for developers and enables front-end components to be bound to Cloud Firestore documents, and therefore to update automatically whenever the source data changes.Choosing App Engine to host the front-end application allowed us to focus on UI development without the distractions of server management or configuration deployment. This helped the team rapidly produce an engaging front end.Cloud Functions ran our backend API services:The UI needs to save trades to Cloud Firestore, and Cloud Functions facilitate this serverlessly. This serverless backend means we can focus on development logic, rather than server configuration or schema definitions, thereby significantly reducing the length of our development iterations.BigQuery and BigQuery ML stored and analyzed tweetsBigQuery solves so many diverse problems that it can be easy to forget how many aspects of this project it enables. First, it reliably ingests and stores volumes of streaming Twitter data at scale and economically, with minimal integration effort. The daemon process code for ingesting tweets consists of 83 lines of Javascript, with only 19 of those lines pertaining to BigQuery.Next, it lets us extract features and labels from the ingested data, using standard SQL syntax. Most importantly, it brings ML capabilities to the data itself with BigQuery ML, allowing us to train a model on features extracted from the data, ultimately exposing predictions at runtime by querying the model with standard SQL.BigQuery ML can help solve two significant problems that the financial services community faces daily. First, it brings predictive modeling capabilities to the data, sparing the cost, time and regulatory risk associated with migrating sensitive data to external predictive models. Second, it allows these models to be developed using common SQL syntax, empowering data analysts to make predictions and develop statistical insights. At Next ‘18 London, one attendee in the pit observed that the tool fills an important gap between data analysts, who might have deep familiarity with their particular domain’s data but less familiarity with statistics; and data scientists, who possess expertise around machine learning but may be unfamiliar with the particular problem domain. We believe BigQuery ML helps address a significant talent shortage in financial services by blending these two distinct roles into one.Structuring and modeling the dataOur model training approach is as follows:First, persist raw data in the simplest form possible: filter the Twitter Enterprise API feed for tweets containing specific hashtags (pulled from a pre-defined subset), and persist a two-column time-series consisting of the specific hashtag as well as the timestamp of that tweet as it was observed in the Twitter feed.Second, define a view in SQL that sits atop the main time-series table and extracts features from the raw Twitter data. We selected features that allow the model to predict the number of tweet occurrences for a given hashtag within the next 10-minute period. Specifically:#Hashtag#fintech may have behaviors distinct from #blockchain and distinct from #brexit, so the model should be aware of this as a feature.Day of weekSunday’s tweet behaviors will be different from Thursday’s tweet behaviors.Specific intra-day windowWe sliced a 24-hour day into 144 10-minute segments, so the model can inform us on trend differences between various parts of the 24-hour cycle.Average tweet count from the past hourThese values are calculated by the view based upon the primary time-series data.Average tweet velocity from the past hourTo predict future tweet counts accurately, the model should know how active the hashtag has been in the prior hour, and whether that activity was smooth (say, 100 tweets consistently for each of the last six 10-minute windows) or bursty (say, five 10-minute windows with 0 tweets followed by one window with 600 tweets).Tweet count rangeThis is our label, the final output value that the model will predict. The contract issuance process running on the job server contains logic for issuing options contracts with strike ranges for each hashtag and 10-minute window (Range 1: 0-100, Range 2: 101-250, etc.) We took the large historical Twitter dataset and, using the same logic, stamped each example with a label indicating the range that would have been in-the-money. Just as equity option chains issued on a stock are informed by the specific stock’s price history, our exchange’s option chains are informed by the underlying hashtag’s volume history.Train the model on this SQL view. BigQuery ML makes model training an incredibly accessible exercise. While remaining inside the data warehouse, we use a SQL statement to declare that we want to create a model trained on a particular view containing the source data, and using a particular column as a label.Finally, deploy the trained model in production. Again using SQL, simply query the model based on certain input parameters, just as you would query any table.Trading options contractsTo make the experience engaging, we wanted to recreate a bit of the open-outcry pit experience by having multiple large “market data” screens for attendees (the trading crowd) to track contract and participant performance. Demo participants used Pixel 2 handsets in the pit to place orders using a simple UI, from which they could allocate their credits to any or all of the three hashtags. When placing their order, they chose between relying on their own forecast, or using the predictions of a BigQuery ML model for their specific options portfolio, among the list of contracts currently trading in the market. Once the trades were made for their particular contracts, they monitored how their trades performed compared to other “traders” in real-time, then saw how accurate the respective predictions were when the trading window closed at expiration time (every 10 minutes).ML training processIn order to easily generate useful predictions about tweet volumes, we use a three-part process, First, we store tweet time series data to a BigQuery table. Second, we layer views are layered on top of this table to extract the features and labels required for model training. Finally, we use BigQuery ML to train and get predictions from the model.The canonical list of hashtags to be counted is stored within a BigQuery table named “hashtags”. This is joined with the “tweets” table to determine aggregates for each time window.Example 1: Schema definition for the “hashtags” table1. Store tweet time series data The tweet listener writes tags, timestamps, and other metadata to a BigQuery table named “tweets” that possesses the schema listed in example 2:Example 2: Schema definition for the “tweets” table2. Extract features via layered viewsThe lowest-level view calculates the count of each hashtag’s occurrence, per intraday window. The mid-level view extracts the features mentioned in the above section (“Structuring and modeling the data”). The top-level view then extracts the label (i.e., the “would-have-been in-the-money” strike range) from that time-series data. Lowest-level view The lowest-level view is defined by the SQL in example 3. The view definition contains logic to aggregate tweet history into 10-minute buckets (with 144 of these buckets per 24-hour day) by hashtag.Example 3: low-level view definitionb. Intermediate viewThe selection of some features (for example: hashtag, day-of-week or specific intraday window) is straightforward, while others (such as average tweet count and velocity for the past hour) are more complex. The SQL in example 4 illustrates these more complex feature selections.Example 4: intermediate view definition for adding featuresc. Highest-level viewHaving selected all necessary features in the prior view, it’s time to select the label. The label should be the strike range that would have been in-the-money for a given historical hashtag and ten-minute-window. The application’s “Contract Issuance” batch job generates strike ranges for every 10-minute window, and its “Expiration and Settlement” job determines which contract (range) struck in-the-money. When labeling historical examples for model training, it’s critical to apply this exact same application logic.Example 5: highest level view3. Train and get predictions from modelHaving created a view containing our features and label, we refer to the view in our BigQuery ML model creation statement:Example 6: model creationThen, at the time of contract issuance, we execute a query against the model to retrieve a prediction as to which contract will be in-the-money.Example 7: SELECTing predictions FROM the modeImprovementsThe exchange was built with a relatively short lead time, hence there were several architectural and tactical simplifications made in order to realistically ship on schedule. Future iterations of the exchange will look to implement several enhancements, such as:Introduce Cloud Pub/Sub into the architectureCloud Pub/Sub is an enabler for refined data pipeline architectures, and it stands to improve several areas within the exchange’s solution architecture. For example, it would reduce the latency of reported tweet counts by allowing the requisite components to be event-driven rather than batch-oriented.Replace VM `cron` jobs with Cloud SchedulerThe current architecture relies on Linux `cron`, running on a Compute Engine instance, for issuing and expiring options contracts, which contributes to the net administrative footprint of the solution. Launched in November of last year (after the version 1 architecture had been deployed), Cloud Scheduler will enable the team to provide comparable functionality with less infrastructural overhead.Reduce the size of the code base by leveraging Dataflow templatesOften, solutions contain non-trivial amounts of code responsible for simply moving data from one place to another, like persisting Pub/Sub messages to BigQuery. Cloud Dataflow templates allow development teams to shed these non-differentiating lines of code from their applications and simply configure and manage specific pipelines for many common use cases. Expand the stored attributes of ingested tweetsStoring the geographical tweet origins and the actual texts of ingested tweets could provide a richer basis from which future contracts may be defined. For example, sentiment analysis could be performed on the Tweet contents for particular hashtags, thus allowing binary contracts to be issued pertaining to the overall sentiment on a topic.Consider BigQuery user-defined functions (UDFs) to eliminate duplicate code among batch jobs and model executionCertain functionality, such as the ability to nimbly deal with time in 10-minute slices, is required by multiple pillars of the architecture, and resulted in the team deploying duplicate algorithms in both SQL and Javascript. With BigQuery UDFs, the team can author the algorithm once, in Javascript, and leverage the same code assets in both the Javascript batch processes as well as in the BigQuery ML models.A screenshot of the exchange dashboard during a trading sessionIf you’re interested in learning more about BigQuery ML, check out our documentation, or more broadly, have a look at our solutions for the financial services industry, or check out this interactive BigQuery ML walkthrough video. Or, if you’re able to attend Google Next ‘19 in San Francisco, you can even try out the exchange for yourself.
Quelle: Google Cloud Platform

31. Januar 2019

da Agency

AI in Depth: Cloud Dataproc meets TensorFlow on YARN: Let TonY help you train right in your cluster

Apache Hadoop has become an established and long-running framework for distributed storage and data processing. Google’s Cloud Dataproc is a fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simple, cost-efficient way. With Cloud Dataproc, you can set up a distributed storage platform without worrying about the underlying infrastructure. But what if you want to train TensorFlow workloads directly on your distributed data store?This post will explain how to install a Hadoop cluster for LinkedIn open-source project TonY (TensorFlow on YARN). You will deploy a Hadoop cluster using Cloud Dataproc and TonY to launch a distributed machine learning job. We’ll explore how you can use two of the most popular machine learning frameworks: TensorFlow and PyTorch.TensorFlow supports distributed training, allowing portions of the model’s graph to be computed on different nodes. This distributed property can be used to split up computation to run on multiple servers in parallel. Orchestrating distributed TensorFlow is not a trivial task and not something that all data scientists and machine learning engineers have the expertise, or desire, to do—particularly since it must be done manually. TonY provides a flexible and sustainable way to bridge the gap between the analytics powers of distributed TensorFlow and the scaling powers of Hadoop. With TonY, you no longer need to configure your cluster specification manually, a task that can be tedious, especially for large clusters.The components of our system:First, Apache HadoopApache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Hadoop services provides for data storage, data processing, data access, data governance, security, and operations.Next, Cloud DataprocCloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Cloud Dataproc’s automation capability helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don’t need them. With less time and money spent on administration, you can focus on your jobs and your data.And now TonYTonY is a framework that enables you to natively run deep learning jobs on Apache Hadoop. It currently supports TensorFlow and PyTorch. TonY enables running either single node or distributed training as a Hadoop application. This native connector, together with other TonY features, runs machine learning jobs reliably and flexibly.InstallationSetup a Google Cloud Platform projectGet started on Google Cloud Platform (GCP) by creating a new project, using the instructions found here.Create a Cloud Storage bucketThen create a Cloud Storage bucket. Reference here.Create a Hadoop cluster via Cloud Dataproc using initialization actionsYou can create your Hadoop cluster directly from Cloud Console or via an appropriate `gcloud` command. The following command initializes a cluster that consists of 1 master and 2 workers:When creating a Cloud Dataproc cluster, you can specify in your TonY initialization actions script that Cloud Dataproc should run on all nodes in your Cloud Dataproc cluster immediately after the cluster is set up.Note: Use Cloud Dataproc version 1.3-deb9, which is supported for this deployment. Cloud Dataproc version 1.3-deb9 provides Hadoop version 2.9.0. Check this version list for details.Once your cluster is created. You can verify that under Cloud Console > Big Data > Cloud Dataproc > Clusters, that cluster installation is completed and your cluster’s status is Running.Go to Cloud Console > Big Data > Cloud Dataproc > Clusters and select your new cluster:You will see the Master and Worker nodes.Connect to your Cloud Dataproc master server via SSHClick on SSH and connect remotely to Master server.Verify that your YARN nodes are activeExampleInstalling TonYTonY’s Cloud Dataproc initialization action will do the following:Install and build TonY from GitHub repository.Create a sample folder containing TonY examples, for the following frameworks:TensorFlowPyTorchThe following folders are created:TonY install folder (TONY_INSTALL_FOLDER) is located by default in:TonY samples folder (TONY_SAMPLES_FOLDER) is located by default in:The Tony samples folder will provide 2 examples to run distributed machine learning jobs using:TensorFlow MNIST examplePyTorch MNIST exampleRunning a TensorFlow distributed jobLaunch a TensorFlow training jobYou will be launching the Dataproc job using a `gcloud` command.The following folder structure was created during installation in `TONY_SAMPLES_FOLDER`, where you will find a sample Python script to run the distributed TensorFlow job.This is a basic MNIST model, but it serves as a good example of using TonY with distributed TensorFlow. This MNIST example uses “data parallelism,” by which you use the same model in every device, using different training samples to train the model in each device. There are many ways to specify this structure in TensorFlow, but in this case, we use “between-graph replication” using tf.train.replica_device_setter.DependenciesTensorFlow version 1.9Note: If you require a more recent TensorFlow and TensorBoard version, take a look at the progress of this issue to be able to upgrade to latest TensorFlow version.Connect to Cloud ShellOpen Cloud Shell via the console UI:Use the following gcloud command to create a new job. Once launched, you can monitor the job. (See the section below on where to find the job monitoring dashboard in Cloud Console.)Running a PyTorch distributed jobLaunch your PyTorch training jobFor PyTorch as well, you can launch your Cloud Dataproc job using gcloud command.The following folder structure was created during installation in the TONY_SAMPLES_FOLDER, where you will find an available sample script to run the TensorFlow distributed job:DependenciesPyTorch version 0.4Torch Vision 0.2.1Launch a PyTorch training jobVerify your job is running successfullyYou can track Job status from the Dataproc Jobs tab: navigate to Cloud Console > Big Data > Dataproc > Jobs.Access your Hadoop UILogging via web to Cloud Dataproc’s master node via web: http://<Node_IP>:8088 and track Job status. Please take a look at this section to see how to access the Cloud Dataproc UI.Cleanup resourcesDelete your Cloud Dataproc clusterConclusionDeploying TensorFlow on YARN enables you to train models straight from your data infrastructure that lives in HDFS and Cloud Storage. If you’d like to learn more about some of the related topics mentioned in this post, feel free to check out the following documentation links:Machine Learning with TensorFlow on GCPHyperparameter tuning on GCPHow to train ML models using GCPAcknowledgements: Anthony Hsu, LinkedIn Software Engineer; and Zhe Zhang, LinkedIn Core Big Data Infra team manager.
Quelle: Google Cloud Platform

31. Januar 2019

da Agency

Ansible solution now available in the Azure Marketplace

Last year, we shared the release of a great developer experience through Ansible in Azure Cloud Shell, and the Ansible extension for Visual Studio Code. Today we’re excited to be expanding our support of Ansible on Azure with a fully configured Azure Marketplace Ansible solution.

As you know from interacting with Microsoft Azure services, we delivered a suite of Ansible cloud modules that help you automate provisioning and orchestrate your infrastructure on Azure. Using Ansible cloud modules for Azure requires authenticating with the Azure API. This Ansible solution enables teams to use Ansible with managed identities for Azure resources, formerly known as Managed Service Identity (MSI). This means you can use the identity to authenticate to any service that supports Azure Active Directory authentication, without any credentials in your code or environment variables.

Furthermore, this solution template also permits you to select the Ansible releases later than version 2.5.0, such as 2.7.5, according to your needs. By default, the latest release version is used. It also enables Azure CLI. These features allow you to use a consistently hosted instance of Ansible for cloud configuration & management and production scenarios.

You can search Ansible in your Azure portal or go to Azure Marketplace and select Get it now to create a hosted instance of Ansible. You also can find a three minute QuickStart that provides a step-by-step walkthrough.

Ansible solution and Ansible extension – Develop locally, run in Azure

It’s common to write Ansible playbooks in a local environment, but run them in a fully configured Ansible host. We added secure shell support for the Ansible extension to permit Ansible developers to copy Ansible playbooks or the whole workspace, and run them in the remote Ansible host. Now Ansible developers can use Ansible extension in Visual Studio Code, a free code editor which runs on macOS, Linux, and Windows operating systems to develop Ansible playbooks in any platform and run them in a fully configured Ansible host created by Ansible solution.

You need to grant the virtual machine access to the subscription used to connect to the Ansible host from your local environment.

We are excited about the improving developer experience we are creating for Ansible on Azure. Go ahead and try the Ansible solution. For more information visit the Ansible on Azure developer hub.
Quelle: Azure

31. Januar 2019

da Agency

QnA Maker simplifies knowledge base management for your Q&A bot

This post was co-authored by the QnA Maker Team.

With Microsoft Bot Framework, you can build chatbots and conversational applications in a variety of ways. Whether you’re looking to develop a bot from scratch with the open source Bot Framework, looking to create your own branded assistant with the Virtual Assistant solution accelerator, or looking to create a Q&A bot in minutes with QnA Maker. QnA Maker is an easy-to-use web-based service that makes it easy to power a question-answer application or chatbot from semi-structured content like FAQ documents and product manuals. With QnA Maker, developers can build, train, and publish question and answer bots in minutes.

Today, we are excited to reveal the launch of a highly requested feature, Active Learning in QnA Maker. Active Learning helps identify and recommend question variations for any question and allows you to add them to your knowledge base. Your knowledge base content won’t change unless you choose to add or edit the suggestions to the knowledge base.

How it works

Active Learning is triggered based on the scores of top N answers returned by QnA Maker for any given query. If the score differences lie within a small range, then the query is considered a possible “suggestion” for each of the possible answers. The exact score difference logic is a function of the score root of the confidence score of the top answer.

All the suggestions are then clustered together by similarity and top suggestions for alternate questions are displayed based on the frequency of the particular queries by end users. Therefore, active learning gives the best possible suggestions in cases where the endpoints are getting a reasonable quantity and variety in terms of usage queries.

QnA Maker learns new question variations in two possible ways.

Implicit feedback – The ranker understands when a user question has multiple answers with scores which are very close and considers that as implicit feedback.
Explicit feedback – When multiple answers with little variation in scores are returned from the knowledge base, the client application can ask the user which question is the correct question. When the user selects the correct question, the user's explicit feedback is sent to QnA Maker with the Train API.

Either method provides the ranker with similar queries that are clustered. When similar queries are clustered, QnA Maker suggests the user-based questions to the knowledge base designer to accept or reject.

How to turn on active learning

By default, Active Learning will be disabled for everybody. Please follow the below steps to enable the Active Learning.

1. To turn active learning on, go to your Service Settings in the QnA Maker portal, in the top-right corner.

2. Find the QnA Maker service then toggle Active Learning.

Once Active Learning is enabled, the knowledge suggests new questions at regular intervals based on user-submitted questions. You can disable Active Learning by toggling the setting again.

How to add Active Learning suggestion to the knowledge base

1. In order to see the suggested questions, on the Edit knowledge base page, select Show Suggestions.

2. Filter the knowledge base with question and answer pairs to only show suggestions by selecting Filter by Suggestions.

3. Each question section with suggestions shows the new questions with a check mark to accept the question or an x mark to reject the suggestions. Click on the checkmark to add the question.

You can add or delete all suggestions by selecting Add all or Reject all.

4. Select Save and Train to save the changes to the knowledge base.

To use Active Learning efficiently, one should have higher traffic on the bot. Higher the number of end-user queries, the better will be quality and quantity of suggestions.

QnA Maker active learning Dialog

The QnA Maker active learning Dialog does the following:

Get the Top N matches from the QnA service for every query above the threshold set.
If the top result confidence score is significantly more than the rest of the results, show only the top answer.
If the Top N results have similar confidence scores, then we prompt the user asking which of the following question he meant.
Once the user selects the right question that matches intent, show the answer for that corresponding question.
This selection also triggers feedback into the QnA Maker service via the Train API.

Migrating knowledge bases from the old preview portal

You may recall at //Build last May 2018, we announced the general availability (GA) of QnA Maker with new architecture built on Azure. As a result, knowledge bases created with QnA Maker free preview will need to be migrated to QnA GA, as the QnA Maker preview will be deprecated January 31, 2019. Learn how to migrate existing ones on the documentation, “Migrate a knowledge base using export-import.”

Below is a screenshot of the old QnA Maker preview portal for reference:

For more information about the changes in QnA Maker GA, see the QnA Maker GA announcement blog post, “Announcing General Availability of QnAMaker.”

QnA Maker GA highlights:

New architecture. The data and runtime components of the QnAMaker stack will be hosted in the user’s Azure subscription. Learn more on the documentation, “What is QnA Maker?”
No more throttling. Pay for services hosted, instead of transactions. See pricing information.
Data privacy and compliance. The QnA data will be hosted within your Azure compliance boundary.
Brand new portal experience to create and manage your knowledge base. Check out the new portal.
Scale as you go. Scale different part of the stack as per your needs. See upgrading your QnA Maker service.

Quelle: Azure

31. Januar 2019

da Agency

Data analytics, meet containers: Kubernetes Operator for Apache Spark now in beta

Many organizations run Apache Spark, a widely-used data analytics engine for large-scale data processing, and are also eager to use Kubernetes and associated tools like kubectl. Today, we’re announcing the beta launch of the Kubernetes Operator for Apache Spark (referred to as Spark Operator, for short, from here on), which helps you easily manage your Spark applications natively from Kubernetes. It is available today in the GCP Marketplace for Kubernetes.Traditionally, large-scale data processing workloads—Spark jobs included—run on dedicated software stacks such as Yarn or Mesos. With the rising popularity of microservices and containers, organizations have demonstrated a need for first-class support for data processing and machine learning workloads in Kubernetes. One of the most important community efforts in this area is native Kubernetes integration, available in Spark since version 2.3.0. The diagram below illustrates how this integration works on a Kubernetes cluster.The Kubernetes Operator for Apache Spark runs, monitors and manages the lifecycle of Spark applications leveraging its native Kubernetes integration. Specifically, this operator is a Kubernetes custom controller that uses custom resources for declarative specification of Spark applications. The controller offers fine-grained lifecycle management of Spark applications, including support for automatic restart using a configurable restart policy, and for running cron-based, scheduled applications. It provides improved elasticity and integration with Kubernetes services such as logging and monitoring. With the Spark Operator, you can create a declarative specification that describes your Spark applications and use native Kubernetes tooling such as kubectl to manage your applications. As a result, you now have a common control plane for managing different kinds of workloads on Kubernetes, simplifying management and improving your cluster’s resource utilization.With this launch, the Spark Operator for Apache Spark is ready for use for large scale data transformation, analytics, and machine learning on Google Cloud Platform (GCP). It supports the new and improved Apache Spark 2.4, letting you run PySpark and SparkR applications on Kubernetes. It also includes many enhancements and fixes that improves its reliability and observability, and can be easily installed with Helm.Support for Spark 2.4Spark 2.4, released in October last year, features improved Kubernetes integration. First, it now supports Python and R Spark applications with Docker images tailored to the language bindings. Second, it provides support for client mode, allowing interactive applications such as the Spark Shell and data science tools like Jupyter and Apache Zeppelin notebooks to run computations natively on Kubernetes. Then, there’s support for certain types of Kubernetes data volumes. Combined with other enhancements and fixes that make native Kubernetes integration more reliable and usable, our Spark Operator is a cloud-native solution that makes it easy to run and manage Spark applications on Kubernetes.Integration with GCPThe Operator integrates with various GCP products and services, including Stackdriver for logging and monitoring, and Cloud Storage and BigQuery—for storage and analytics. Specifically, the operator exposes application-level metrics in the Prometheus data format and automatically configures Spark applications to expose driver- and executor-level metrics to Prometheus. With a Prometheus server with the Stackdriver sidecar, your cluster can automatically collect metrics and send them to Stackdriver Monitoring. Application driver and executor logs are automatically collected and pushed to Stackdriver when the application runs on Google Kubernetes Engine (GKE).The Spark Operator also includes a command-line tool named sparkctl that automatically detects an application’s dependencies on the user’s client machine and uploads them to a Cloud Storage bucket. It then substitutes the client-local dependencies with the ones stored in the Cloud Storage portion of the application specification, greatly simplifying the use of client-local application dependencies in a Kubernetes environment.The Spark Operator for Apache Spark ships with a custom Spark 2.4 Dockerfile that supports using Cloud Storage for input or output data in an application. This Dockerfile also includes the Prometheus JMX exporter, which exposes Spark metrics in the Prometheus data format. The Prometheus JMX exporter is Spark Operator’s default approach for using and configuring an application when Prometheus monitoring is enabled.The GCP Marketplace: a one stop shopGCP Marketplace for Kubernetes is a one-stop shop for major Kubernetes applications. The Spark Operator is available for quick installation on the Marketplace, including logging, monitoring, and integration with other GCP services out of the gate.An active communityThe Spark Operator for Apache Spark has an active community of contributors and users. The Spark Operator is currently deployed and used by several organizations for machine learning and analytics use cases, and has a dedicated Slack channel with over 170 members that engage in active daily discussions. Its GitHub repository has commits from over 20 contributors from a variety of organizations and has close to 300 stars—here’s a shout-out to all those who have made the project what it is today!Looking forwardWith a growing community around the project, we are constantly working on ideas and plans to improve it. Going into 2019, we’re working on the following features with the community:Running and managing applications of different Spark versions with the native Kubernetes integration. Currently, an Operator version only supports a specific Spark version. For example, an Operator version that is compatible with Spark 2.4 cannot be used to run Spark 2.3.x applications.Priority queues and basic priority-based scheduling. This will make the Operator better suited for running production batch processing workloads, e.g., ETL pipelines.Kerberos authentication, starting with Spark 3.0.Using a Kubernetes Pod template to configure the Spark driver and/or executor Pods.Enhancements to the Operator’s sparkctl command-line tool, for example, making it a kubectl plugin that you can manage with tools like krew.If you are interested in trying out the Kubernetes Operator, please install it directly from the GCP Marketplace, check out the documentation and let us know if you have any questions, feedback, or issues.
Quelle: Google Cloud Platform

31. Januar 2019

da Agency

Tune up your SLI metrics: CRE life lessons

The site reliability engineering (SRE) we practice here at Google comes to life in our customer reliability engineering (CRE) teams and practices. We support customers during their daily processes, as well as during peak events where traffic and customer expectations are high.CRE comes with some practical metrics to quantify service levels and expectations, namely SLOs, SLIs and SLAs. In previous CRE Life Lessons, we’ve talked about the importance of Service Level Indicators (SLIs) to measure an approximation of your customers’ experience. In this post, we’ll look at how you can tune your existing SLIs to be a better representation of what your customers are experiencing.If you’re just getting started with your own SRE practice, it’s important to remember that almost any SLI is better than no SLI. Putting numbers and concrete goals out there focuses the conversation between different parts of your org, even if you don’t use your fledgling SLIs to do things like page oncallers or freeze releases. Quantifying customer happiness metrics is usually a team journey, pretty much no one gets it right the first time..SLIs can help you understand and improve customer experience with your site or services. The cleaner your SLIs are, and the better they correlate with end-user problems, the more directly useful they will be to you. The ideal SLI to strive for (and perhaps never reach) is a near-real-time metric expressed as a percentage, which varies from 0%—all your customers are having a terrible time—to 100%, where all your customers feel your site is working perfectly.Once you have defined an SLI, you need to find the right target level for it. When the SLI is above the target, your customers are generally happy, and when the SLI is below the target, your customers are generally unhappy. This is the level for your Service Level Objective (SLO). As we have discussed in previous CRE Life Lessons, SLOs are the primary mechanism to balance reliability and innovation, and so improving the quality of your SLO lets you judge this balance better, guiding the strategy of your application development and operation. There’s also a tactical purpose for SLIs: once you have a target level for them, you can start to use it to drive your response to outages. If your measured SLI is too far below your target level for too long, you have to assume that your customers are having a bad time and you need to start doing something about it. If you’re a bit below target, your on-call engineer might start investigating when they get into the office. If you’re critically below target, to the point where you’re in danger of overspending your error budget, then you should send a pager alert and get their immediate attention.But all this presumes that your SLIs represent customer experience. How do you know whether that’s true in the first place? Perhaps you’re wasting effort measuring and optimizing metrics that aren’t really that important? Let’s get into the details of how you can tune SLIs to better represent customer experience.In search of customer happinessFor the purposes of this blog post, let’s assume that you’ve defined—and measured—an initial set of SLIs for your service, but you don’t know yet what your SLO targets should be, or even whether your SLIs really do represent your customers’ experience. Let’s look at how to find out whether your SLIs need a tune-up.A good SLI will exhibit the following properties:It rises when your customers become happier;It falls when your customers become less happy;It shows materially different measurements during an outage as compared to normal operations;It oscillates within a narrow band (i.e., showing a low variance) during normal operations.The fourth property is fairly easy to observe in isolation, but to calibrate the other properties you need additional insights into your customers’ happiness. The real question is whether they think your service is having an outage.To get these “happiness” measurements, you might gauge:The rate at which customers get what they want from your site, such as commercial transaction completion rates (for an e-commerce application) or stream concurrent/total views (for streaming video);Support center complaints;Online support forum post rates; and perhaps evenTwitter mentions.Compile this information to build a picture of when your customers were experiencing pain. Start simple: you don’t need a “system” to do this, start with a spreadsheet lining up your SLIs’ data with, say, support center calls. This exercise is invaluable to cross-check your SLIs against “reality”. A correlation view might look like the following overlay on your SLI graph, where pink bars denote customer pain events:This example shows that your SLI is quite good but not perfect; big drops tend to correlate with customer pain, but there is an undetected pain event with no SLI drop and a couple of SLI drops without known pain.Testing customer SLIs during peak eventsOne good way to test your SLI against your customers’ experience is with data from a compelling event—a period of a few days when your service comes under unusually heavy scrutiny and load. For retailers, the Black Friday/Cyber Monday weekend at the end of November and Singles’ Day on November 11 are examples of this. During these events, customers are more likely to tell you—directly or indirectly—when they’re unhappy. They’re racing to take advantage of all the superb deals your site is offering, and if they can’t do so, they’ll tell you. In addition to those who complain, many more will have left your site in silent displeasure.Or suppose that your company’s service lets customers stream live sports matches. They care about reliability all the time, but really care about reliability during specific events such as the FIFA World Cup final, because nearly everyone is watching one specific stream and is hanging on every moment of the game. If they miss seeing the critical goal in the World Cup final because your streaming service died, they are going to be quite unhappy and they’re going to let you know about it.Be aware, however, that this isn’t the whole story; it tells you what “really bad” looks like to your customers, but there is generally not enough data to determine what “annoying” looks like. For the latter, you need to look at a much longer operational period for your service.Analyzing SLI dataSuppose you had a reasonably successful day, with no customer-perceived outages that you know about. So your customer happiness metric was flat, but your 24-hour SLI view looks like the diagram below.This presents you with a problem. You have three periods in the day where your failure rate is much higher than normal, for a significant amount of time (30-plus minutes). Your customers aren’t complaining, so either you’re not detecting customer complaints, or your SLI is showing errors that customers aren’t seeing. (See below for what to do in this situation.)On the other hand, if you see the following with the same lack of customer complaint history:Then this suggests you might have quite a good SLI. There are a couple of transient events that might relate to glitches in monitoring or infrastructure, but overall the SLI is low in volatility.Now let’s consider a day where you know you had a problem: Early in the day, you know that your users started to get very unhappy because your support call volume went through the roof around 9am, and stayed that way for several hours before tailing off. You look at your SLI for the day and you see the following:The SLI clearly supports the customer feedback in its general shape. You see a couple of hours when the error rate was more than 10x normal, a very brief recovery, another couple of hours of high error rate, then a gradual recovery. That’s a strong signal that the SLI is correlated with the happiness of your users.The most interesting point to investigate is the opening of the incident: Your SLI shows you a transient spike in errors (“First spike” in the diagram), followed by a sustained high error rate. How well did this capture when real users started to see problems? For this, you probably need to trawl through your server and client logs for the time in question. Is this when you start to see a dip in indicators such as transactions completed? Remember that support calls or forum posts will inevitably lag user-visible errors by many minutes—though sometimes you will get lucky and people will write things like “about 10 minutes ago I started seeing errors.”At least as important: Consider is how your failure rate scales compared to the percentage of users who can complete their transactions. Your SLI might be showing a 25% error rate, but that could make your system effectively unusable if users have to complete three actions in a series for a transaction to complete. That would mean the probability of success is 75% x 75% x 75%, i.e., only 42%, and so nearly 60% of transaction attempts would fail. (That is, the failure rate is cumulative because users can’t attempt action two before action one is complete).My customers are unhappy but my SLIs are fineIf this is your situation, have you considered acquiring customers who are more aligned with your existing SLIs?We’re kidding, of course—this turns out to be quite challenging, not to mention bad for business, and you may well end up believing that you have a real gap in your SLI monitoring. In this case, you need to go back and do an analysis of the event in question: Are there existing (non-SLI) monitoring signals that show the user impact starting to happen, such as a drop in completed payments on your site? If so, can you derive an additional SLI from them? If not, what else do you have to measure to be able to detect this problem?At Google, we have a specific tag: “Customer found it first” when we’re carrying out our postmortems. This denotes a situation when the monitoring signals driving engineer response didn’t clearly indicate a problem before the customers noticed, and no engineer could prove that an alert would have fired. Postmortems with this tag, where a critical mass of customers noticed the problem first, should always have action items addressing this gap. That may be either expanding the existing set of SLIs for the service to cover these situations, or tightening the thresholds of existing SLOs so that the violation gets flagged earlier.But wait: You already have some signals for “customer happiness,” as we’ve noted above. Why don’t you use one of them, such as support center call rates, as one of your SLIs? Well, the principle we recommend is that you should use reports of customers’ dissatisfaction as a signal to calibrate your SLIs, but ultimately you should rely on SLIs rather than customer reports to know how well your product behaves.Generally, the decision about using these as SLIs comes down to a combination of signal resolution and time lag. Any signal which relies on human action inevitably introduces lag; a human has to notice that there’s a problem, decide that it’s severe enough to act on, then find the right web page to report it. That will add up to tens of minutes before you start getting enough of a signal to alert your on-call. In addition, unless you have a huge user base, the very small fraction of users who care enough about your service to report problems will mean that user reports are an intrinsically noisy signal, and might well come from an unrepresentative subset of your users.There’s also the option of measuring your SLIs elsewhere. For instance, if you have a mobile application, you might measure your customers’ experience there and push that data back to your service. This can improve the quality of your SLI measurement, which we’d strongly encourage, providing a clearer picture of how bad an incident was – and how much error budget it spent. However, that can cause your data to take longer to arrive in your monitoring system than when you measure your service performance directly, so don’t rely on this as a timely source of alerts.My SLIs are unhappy but my customers are fineWe often think of this situation as one of “polluted” SLIs. As well as transient spikes of failures that users don’t notice, your SLI might show periodic peaks of errors from when a batch system sends a daily (or hourly) flurry of queries to your service, which do not have the same importance as queries from your customer. They might just be retried later if they fail, or maybe the data is not time-sensitive. Because you aren’t distinguishing errors served to your internal systems from errors served to your customers, you can’t determine whether the batch traffic is impacting your customer experience; the batch traffic is hiding what could otherwise be a perfectly good signal.When you later try to define an SLO in this unhappy situation, and set up alerts for when you spend too much of your error budget, you’ll be forced to either spend a significant amount of error budget during each peak, or set the SLO to be so loose that real users will endure a significant amount of pain before the error budget spend starts to show any impact.So you should figure out a way to exclude these batch queries from your SLI: they don’t deserve the same SLO as queries from real customers. Possible approaches include:Configure the batch queries to hit a different endpoint;Measure the SLI at a different point in the architecture where the batch queries don’t appear;Replace the SLI by measuring something at a higher level of abstraction (such as a complete user transaction);Tag each query with “batch,” “interactive,” or another category, and break down SLI reporting by these tags; orPost-process the logs to remove the batch queries from your accounting.In the above example, the problem is that the SLI doesn’t really describe user harm from the point of view of the batch system; we don’t know what the consequence of returning 10% errors to the batch queries might be. So we should remove the batch queries from the regular SLI accounting, and investigate if there’s a better high-level SLI to represent the batch user experience, such as “percentage of financial reports published by their due date”.Everything is awesome (for now)Congratulations! You’ve reviewed your SLIs and customer happiness, and things seem to match up. No, this is not the end. So far in this blog post we’ve focused on critical events as a driver for improving your service’s SLIs, but really this is an activity that you should be doing periodically for your service.We’d recommend that at least every few months you conduct a review of user happiness signals against your measured SLIs, and look in particular for times that users were more unhappy than usual but your SLIs didn’t really move. That’s a sign that you need to find a new SLI, or improve the quality of an existing SLI, in order to detect this unhappiness.Managing multiple SLIsYou might have several SLIs defined for your service, but you know that your users were having a bad time between 10am and 11:30am, and only one of your SLIs was out of bounds during that time. Is that a problem?It’s not a problem per se; you expect different SLIs to measure different aspects of the customer experience. That’s why you have them. At some point, there will be broad problems with your systems where all your SLIs plummet below their targets. An overloaded backend application might cause elevated levels of your latency SLI, but most transactions are still completing, so your availability SLI might show nearly normal levels despite a significant amount of customers experiencing pain.Still, you should expect each SLI to correspond to some kind of user-visible outage. If you’ve had several months of operations with several user-facing outages, and there’s one SLI which doesn’t budge from its normal range at all, you have to ask yourself why you’re monitoring it, alerting on it and acting on its variations. You might well have a good reason for hanging on to it—to capture a kind of problem that you haven’t seen yet, for example—but make sure that it’s not an emotional attachment. Every SLI is a tax on your team’s attention. Make sure it’s a tax worth paying.However, remember that nearly any SLI is better than no SLI. As long as you are aware of the limitations of the SLIs you’re using, they give you valuable information in detecting and evaluating user-facing outages.In a future post we’ll talk about the next step in complexity—deciding how to combine these SLIs into a useful SLO. (But we’ll save that for Future Us.) In the meantime, you can start learning about SRE SLOs in our Coursera course.Thanks to Alec Warner, Alex Bramley, Anton Tolchanov, Dave Rensin, David Ferguson, Gustavo Franco, Kristina Bennett, and Myk Taylor, among others, for their contributions to this post.
Quelle: Google Cloud Platform

31. Januar 2019

da Agency

Extending Stackdriver to on-prem with the newBindPlane integration

We introduced our partnership with Blue Medora last year, and explained in a blog post how it extends Stackdriver’s capabilities. We’re pleased to announce that you can now join our new offering for Blue Medora. If you’re using Stackdriver to monitor your Google Cloud Platform (GCP) or Amazon Web Services (AWS) resources, you can now extend your observability to on-prem infrastructure, Microsoft Azure, databases, hardware devices and more. The recently released BindPlane integration from BlueMedora lets you consolidate all your signals into Stackdriver, GCP’s monitoring tool. This integration connects health and performance signals from a wide variety of sources. Stackdriver and BindPlane together bring an in-depth, hybrid and multi-cloud view into one dashboard.In this post, we’ll show you how to get started adding the BindPlane dimensional data stream into Stackdriver. Questions or want to learn more? Sign up here and we’ll be in touch.Before you get started, you’ll need to have a GCP billing account and project already set up. Learn more here about setting up or modifying a billing account or creating a project.Here’s how to get started with BindPlane:Visit the BindPlane page in the Google Cloud MarketplaceBindPlane is free of charge to Stackdriver customers, but you must activate your service.Find BindPlane in the Google Cloud MarketplaceFrom the BindPlane marketplace listing, click “Start with the free plan” buttonThe BindPlane listing in the Google Cloud MarketplaceSelect your BindPlane planOn step 1, Subscribe, confirm that the “free” plan is selected from the drop-down menuAssign your preferred GCP billing account and click “Subscribe”On step 2, Activate, click “Register with Blue Medora”Confirm your BindPlane plan from the Google Cloud Marketplace before activating your account.Create your BindPlane accountFrom the BindPlane sign up page:Link the GCP project you created to BindPlane by using the project name in the “Company Name” field.Create your account using an email and passwordAccept the end-user licensing agreementClick “Sign up”Create your BindPlane account using your GCP project name as the company nameSign in to BindPlaneOnce you’ve done that, close the registration window and return to the BindPlane marketplace listing. Click the “Manage API keys on Blue Medora website” link to sign in and begin the configuration process.Install the smart collectorBindPlane’s intelligent collectors reside inside your network and send data back to Stackdriver. Unlike an agent, the BindPlane collector automatically updates as new versions become available.You’ll want to install the collector somewhere that has network access to the sources you’re planning to monitor. Don’t worry if you have multiple isolated networks. You can add as many collectors as needed for each network.While you can install a collector on your source’s host, we recommend installing it on its own VM. This limits your configuration effort and allows you to have one collector that monitors multiple sources or services.BindPlane configuration starts with a collectorFrom the BindPlane Getting Started dashboard, click “Add First Collector”Select the operating system on which your collector will be runningCopy the installation commandBindPlane provides a single-line command to install the collector on your system. Just copy the command and run it on your local server to get started.If you get stuck, check out the BlueMedora documentation for additional details on the collector requirements, installation process, how to set up a proxy, and how to test your connection.Success! Your new collector is up and running.Configure a source to monitorA source is any object you’d like to monitor. It could be a database, a web service, or even a hardware device in your data center. BindPlane currently includes more than 150 integration sources and is adding more all the time.From the BindPlane Getting Started screen or the collector success message, click “Add First Source.”The BindPlane source catalog continues to expand each month.Choose one source type from the BindPlane catalog. You can come back later and add others.Select a source type you know is available on the same network or region as your collectorWhen prompted, select your collectorEnter in your credentialsClick “Test Connection” to verify everything’s working correctlyClick “Add” to begin monitoringSetting up a PostgreSQL source for monitoringThe configuration process can vary slightly from source to source, so visit the BindPlane source documentation if you need more details.Connecting your data to StackdriverA destination is a monitoring analytics service like Stackdriver where you can view your collected data. Stackdriver customers are currently the only users who can access BindPlane’s full feature set without any licensing fees.In order to configure a Google Stackdriver destination, you will need to create an IAM service account with the monitoring admin role in GCP. For more information on this process, see IAM Service Account.Once you’ve done that, download the private key JSON file associated with that service account. (See this documentation on creating and managing service keys.)BindPlane connects to many destination platforms, but only Stackdriver customers have access to the service free of charge.Select “Google Stackdriver” as your destination typeName your destination as desiredPaste your JSON key into the “Application Credentials” fieldClick “Test Connection” to verify everything’s working correctlyClick “Add” to stream your data to StackdriverConfiguring Stackdriver as your destination platform requires you use the JSON key from your monitoring admin IAM account.Find your data in StackdriverYour BindPlane data moves into Stackdriver through the Google Cloud Custom Metric API.Within Stackdriver, all metrics will be associated with the Global Monitored Resource type. Use the Metrics Explorer to quickly locate a specific metric, as shown below. The namespace of each metric will be formatted as /{integration}/{resource}/{metric}.That’s it! You’ve now connected Blue Medora’s BindPlane to Stackdriver, so you can visualize and set up alerts on every metric in your environment. Ready to try it yourself? Get started now in the Google Cloud Marketplace. Questions or want to learn more? Sign up here and we’ll be in touch.That’s it! You’ve now connected Blue Medora’s BindPlane to Stackdriver, so you can visualize and set up alerts on every metric in your environment. Ready to try it yourself? Get started now in the Google Cloud Marketplace. Questions or want to learn more? Sign up here and we’ll be in touch.
Quelle: Google Cloud Platform