New observability features for your Splunk Dataflow streaming pipelines

We’re thrilled to announce several new observability features for the Pub/Sub to Splunk Dataflow template to help operators keep a tab on their streaming pipeline performance. Splunk Enterprise and Splunk Cloud customers use the Splunk Dataflow template to reliably export Google Cloud logs for in-depth analytics for security, IT or business use cases. With newly added metrics and improved logging for Splunk IO sink, it’s now easier to answer operational questions such as:Is the Dataflow pipeline keeping up with the volume of logs generated?What is the latency and throughput (Event Per Second or EPS) when writing to Splunk?What is the response status breakdown of downstream Splunk HTTP Event Collector (HEC) and potential error messages?This critical visibility helps you derive your log export service-level indicators (SLIs) and monitor for any pipeline performance regressions. You can also more easily root cause potential downstream failures between Dataflow & Splunk such as Splunk HEC network connections or server issues, and fix the problem before it cascades. To help you quickly chart these new metrics, we’ve included them in the custom dashboard as part of the updated Terraform module for Splunk Dataflow. You can use those Terraform templates to deploy the entire infrastructure for log export to Splunk, or just the Monitoring dashboard alone.Log Export Ops Dashboard for Splunk DataflowMore metricsIn your Dataflow Console, you may have noticed several new custom metrics (highlighted below) for launched jobs as of template version 2022-03-21-00_RC01, that is gs://dataflow-templates/2022-03-21-00_RC01/Cloud_PubSub_to_Splunk or later:Pipeline instrumentationBefore we dive into the new metrics, let’s take a step back and go over the Splunk Dataflow job steps. The following flowchart represents the different stages that comprise a Splunk Dataflow job along with corresponding custom metrics:In this pipeline, we utilize two types of Apache Beam custom metrics:Counter metrics, labeled 1 through 10 above, used to count messages and requests (both successful and failed).Distribution metrics, labeled A through C above, used to report on distribution of request latency (both successful and failed) and batch size. Downstream request visibilitySplunk Dataflow operators have relied on some of these pre-built custom metrics to monitor log messages progress through the different pipeline stages, particularly in the last stage Write To Splunk, with metrics outbound-successful-events (counter #6 above) and outbound-failed-events (counter #7 above) to track the number of messages that were successfully exported (or not) to Splunk. While operators had visibility of the outbound message success rate, they lacked visibility at the HEC request level. Splunk Dataflow operators can now monitor not only the number of successful and failed HEC requests over time, but also the response status breakdown to determine if request failed due to a client request issue (e.g. invalid Splunk index or HEC token), or a transient network or Splunk issue (e.g. server busy or down) all from Dataflow Console with the addition of counters #7-10 above, that is:http-valid-requestshttp-invalid-requestshttp-server-error-requestsSplunk Dataflow operators can also now track average latency of downstream requests to Splunk HEC, as well as average request batch size, by using the new distribution metrics #A-C, that is:successful_write_to_splunk_latency_msunsuccessful_write_to_splunk_latency_mswrite_to_splunk_batchNote that a Distribution metric in Beam is reported by Dataflow as four sub-metrics suffixed with _MAX, _MIN, _MEAN and _COUNT. That is why those 3 new distribution metrics translate to 12 new metrics in Cloud Monitoring, as you can see in the earlier job info screenshot from Dataflow Console. Dataflow currently does not support creating a histogram to visualize the breakdown of these metrics’ values. Therefore, _MEAN metric is the only useful sub-metric for our purposes. As an all-time average value, _MEAN cannot be used to track changes over arbitrary time intervals (e.g. hourly), but it is useful to capture baseline, track trend or to compare different pipelines.Dataflow custom metrics, including aforementioned metrics reported by Splunk Dataflow template, are a chargeable feature of Cloud Monitoring. For more information on metrics pricing, see Pricing for Cloud Monitoring.Improved loggingLogging HEC errorsTo further root cause downstream issues, HEC request errors are now adequately logged, including both response status code and message:You can retrieve them directly in Worker Logs from Dataflow Console by setting log severity to Error.Alternatively, for those who prefer using Logs Explorer, you can use the following query.code_block[StructValue([(u’code’, u’log_id(“dataflow.googleapis.com/worker”)rnresource.type=”dataflow_step”rnresource.labels.step_id=”WriteToSplunk/Write Splunk events”rnseverity=ERROR’), (u’language’, u”)])]Disabling batch logsBy default, Splunk Dataflow workers log every HEC request as follows:Even though these requests are often batched events, these ‘batch logs’ are chatty as they add 2 log messages for every HEC request. With the addition of request-level counters (http-*-requests), latency & batch size distributions, and HEC error logging mentioned above, these batch logs are generally redundant. To control worker log volume, you can now disable these batch logs by setting the new optional template parameter enableBatchLogs to false, when deploying the Splunk Dataflow job. For more details on latest template parameters, refer to template user documentation.Enabling debug level logsThe default logging level for Google provided templates written using the Apache Beam Java SDK is INFO, which means all messages of INFO and higher i.e. WARN and ERROR will be logged. If you’d like to enable lower log levels like DEBUG, you can do so by setting the –defaultWorkerLogLevel flag to DEBUG while starting the pipeline using gcloud command-line tool. You can also override log levels for specific packages or classes with the –workerLogLevelOverridesflag. For example, the HttpEventPublisher class logs the final payload sent to Splunk at the DEBUG level. You can set the –workerLogLevelOverridesflag to {“com.google.cloud.teleport.splunk.HttpEventPublisher”:”DEBUG”} to view the final message in the logs before it is sent to Splunk, and keep the log level at INFO for other classes. Exercise caution while using this as it will log all messages sent to Splunk under the Worker Logs tab in the console, which might lead to log throttling or reveal sensitive information.Putting it all togetherWe put all this together in a single Monitoring dashboard that you can readily use to monitor your log export operations:Pipeline Throughput, Latency & ErrorsThis dashboard is a single pane of glass for monitoring your Pub/Sub to Splunk Dataflow pipeline. Use it to ensure your log export is meeting your dynamic log volume requirements, by scaling to adequate throughput (EPS) rate, while keeping latency and backlog to a minimum. There’s also a panel to track pipeline resource usage and utilization, to help you validate that the pipeline is running cost-efficiently during steady-state.Pipeline Utilization and Worker LogsFor specific guidance on handling and replaying failed messages, refer to Troubleshoot failed messages as part of the Splunk Dataflow reference guide. For general information on troubleshooting any Dataflow pipeline, check out the Troubleshooting and debugging documentation, and for a list of common errors and their resolutions look through the Common error guidance documentation. If you encounter any issue, please open an issue in the Dataflow templates GitHub repository, or open a support case directly in your Google Cloud Console.For a step-by-step guide on how to export GCP logs to Splunk, check out the Deploy production-ready log exports to Splunk using Dataflow tutorial, or use the accompanying Terraform scripts to automate the setup of your log export infrastructure along with the associated operational dashboard.Related ArticleWhat’s new with Splunk Dataflow template: Automatic log parsing, UDF support, and moreAnnouncing new features for Splunk Dataflow template with improved compatibility with Splunk Add-on for GCP, more extensibility using use…Read Article
Quelle: Google Cloud Platform

Google’s open-source solution to DFDL Processing

The cloud has become the choice for extending and modernizing applications, but there are some situations where the transition is not straightforward, such as migrating applications that access data from a mainframe environment.  Migrating the data and the applications at certain points can be outsync.  Mechanisms need to be in place during the transition to support interoperability with legacy workloads and  access data out of the mainframe.  For the latter, the Data Format Description Language  (DFDL) which is an open standard modeling language from the Open Grid Forum (OGF), has been used to access data from a mainframe, e.g. IBM Integration Bus.  DFDL uses a model or schema that allows text or binary data to be parsed from its native format and to be presented as an information set out of the mainframe (i.e., logical representation of the data contents, independent of the physical format). DFDL Processing with IBM App ConnectIf we talk about solutions for parsing and processing data described by DFDL, one of the options in the past has been IBM App Connect which allows development of custom solutions via IBM DFDL. The following diagram represents a high-level architecture of DFDL Solution implementation on IBM App Connect:IBM App Connect brings stable integration to the table at an enterprise level cost. According to IBM’s sticker pricing as of May 2022, IBM App Connect charges $500 and above per month for using the App Connect with IBM Cloud services. These prices are excluding the cost of storing and maintaining DFDL Definitions in the Mainframe. With the introduction of Tailored Fit Pricing on IBMz15, cost of maintaining the mainframe can range from $4900 to $9300 per month over the span of 5 years, which may be costly for a small/medium business only wanting to process data defined by DFDL.Introducing Google Open-Source DFDL Processor with Google CloudAt Google our mission is to build for everyone, everywhere. With this commitment in mind, the Google Cloud team has developed and open-sourced the solution for DFDL Processor which can be easily accessible and customizable for organizations to  use it. We understand that mainframes can be expensive to maintain and use, which is why we have integrated Cloud Firestore and Bigtable as the databases to store the DFDL definitions. Firestore can provide 100K reads, 25K writes, 100K deletes, and 1TB of storage per month for approximately $186 per month. While on the other hand Bigtable provides a fast, scalable database solution for storing terabytes, or even petabytes of data at a relatively lower cost too. This move away from the mainframe and adopting cloud-native database solutions can save organizations thousands of dollars every month.Next, we have substituted App Connect with a combination of our open-source DFDL processor, Cloud Pub/Sub service and open-source Apache Daffodil Library. Pub/Sub provides the connection between the mainframe and the processor, and from the processor to the downstream applications. The Daffodil Library helps in compiling schemas, and outputting infosets for the given DFDL definition and message. The total cost of employing the Pub/Sub service and the Daffodil Library comes out to be approximately $117 per month, which means an organization can save a minimum of $380 per month by using this solution.The table below shows a summary of the cost difference breakdown between the solutions as discussed above:How it worksThe data described by the DFDL usually needs to be available in widely used formats such as JSON, in order to be consumed by downstream applications which might  have already been migrated to a cloud native environment. To achieve the consumption of the data, cloud native applications/services can be implemented in conjunction with Google Cloud Services, which accepts the textual or binary data as input from the mainframe , fetches corresponding DFDL from a database, and finally compiles and outputs the equivalent JSON for the downstreaming applications to consume.The following diagram describes a high level architecture to be presentedAn application can be built to process the information being received from the mainframe, e.g a DFDL Processor Service, leveraging the Daffodil API to parse the data against a corresponding DFDL schema and output the JSON. DFDL schema definitions can be potentially migrated and stored in Firestore or Bigtable. Since these definitions rarely change and they can be stored in a key-value pair format, the storage of preference is a non-relational managed database. Google Cloud Pub/Sub, can leverage an eventing mechanism that receives the binary/textual message from a Data Source, i.e. the mainframe, in a Pub/Sub topic.  This feature will  allow the DFDL Processor to access the data, to retrieve the corresponding DFDL definition from Firestore or Bigtable and finally pass both on to the Daffodil API to compile and output the JSON result. The JSON result is finally published into a resulting Pub/Sub topic for any downstream application to consume. It is recommended to follow CloudEvent schema specification which allows to describe events in common formats, providing interoperability across services platforms and systems.You can find examples of the implementation in Github:  Firestore ExampleBigtable ExampleConclusionIn this post, we have discussed different pipelines used to process data defined by DFDL, and cost comparisons of these pipelines. Additionally, we have demonstrated how to use Cloud Pub/Sub, Firestore, and Bigtable to create a service which is capable of listening to binary event messages,  extract the corresponding DFDL definition from a  managed database, and process it to output a JSON which can then be consumed by downstream applications using well-established technologies and libraries.1. Price comparison analysis as of May 2022 and subject to change based on usageRelated Article5 principles for cloud-native architecture—what it is and how to master itLearn to maximize your use of Google Cloud by adopting a cloud-native architecture.Read Article
Quelle: Google Cloud Platform