Last month today: GCP in June ‘19

As we bid farewell to June, we also say hello to a new partner and new product integrations, all with the goal of making Google Cloud Platform (GCP) ever more useful for your particular needs. Data analytics, in particular, continued to make leaps and bounds this month with new features and integrations. Here’s a look at last month’s top stories. Data analytics, better togetherIn June, we announced our intent to acquire Looker, a unified platform for business intelligence, data application, and embedded analytics. Looker extends our business analytics offering with two important capabilities—first, the ability to define business metrics once in a consistent way across data sources. Second, Looker also provides users with a powerful analytics platform that delivers applications for business intelligence and use-case specific solutions such as sales analytics, as well as a flexible, embedded analytics product to collaborate on business decisions. We look forward to sharing more once the deal closes. We also announced a partnership with data warehouse provider Snowflake, which will help users store and analyze data from a wide variety of sources. You’ll be able to use Snowflake along with our analytics and ML products, so you can store data in GCP, then analyze that data using Snowflake, with strong performance and reliability.Using BigQuery for blockchain, and integrated with KaggleThis post about building hybrid cloud/blockchain applications with Ethereum and GCP explains the use of BigQuery data inside of blockchain for applications like prediction marketplaces and transaction privacy. This post shows you how you can use smart contract platform Ethereum together with BigQuery through Chainlink middleware. This brings bidirectional operation between blockchain data and cloud services, adding efficiency and letting developers create new hybrid applications.Also new in BigQuery this month: Kaggle is now integrated into BigQuery, so you can perform SQL queries, train ML models and then analyze that data using the Kaggle Kernels environment. With Kaggle Kernels and BigQuery, you can link your Google Cloud account, then compose queries directly in the notebook. You can also explore public datasets in BigQuery, and build and evaluate ML regression models without a lot of experience needed. Give it a try using the BigQuery sandbox.Why cloud-native, and why it mattersThe concept of cloud-native architecture has become popular as cloud computing has matured, and this post describes why cloud-native is fundamentally different from on-premises architecture. You’ll get a look at the principles of designing for the cloud, with some tips on improvements you can take advantage of, like more automation, managed services, and defense in depth.School’s out, but certification is inWe announced a new Google Cloud certification challenge in June. Get certified within 12 weeks, and you’ll get a $100 Google Store voucher. It’s not too late to sign up; if you join the challenge, you can get access to Coursera and Qwiklabs resources for free. You can use those to study for either the Google Cloud Certified Associate Cloud Engineer or the Professional Cloud Architect exam. There are also a few new Qwiklab quests to help you learn more about Kubernetes, specifically security and monitoring. The first of these self-paced labs covers migration and observability for containers, whether you’re running Kubernetes, Google Kubernetes Engine (GKE), or Anthos. The second focuses on securing Kubernetes apps, with labs on role-based access control, binary authorization, and more.That’s a wrap for June. We’ll see you next month.
Quelle: Google Cloud Platform

Tips and tricks to get your Cloud Dataflow pipelines into production

As data processing pipelines become foundational for enterprise applications, it’s mission-critical to make sure that your production data pipelines are up and running, and that any updates cause minimum disruption to your user base. When your data pipelines are stable, internal users and customers can trust them, which allows your business to operate with confidence. That extra confidence often leads to better user satisfaction and requests for more features and enhancements for the things your pipelines process. We’ll describe here some of the best practices to take into consideration as you deploy, maintain and update your production Cloud Dataflow pipelines, which of course make use of the Apache Beam SDK. We’ll start with some tips on building pipelines in Cloud Dataflow, then get into how to maintain and manage your pipelines.Building easy-to-maintain data pipelines  The usual rules around good software development apply to your pipelines. Using composable transforms that encapsulate business logic creates pipelines that are easier to maintain, troubleshoot and develop with. Your SREs won’t be happy with the development team if there is an issue at 2am and they are trying to unpick what’s going on in that 50-stage pipeline built entirely with anonymous DoFns. Luckily, Apache Beam provides the fundamental building blocks to build easy-to-understand pipelines. Using composable PTransforms to build pipelinesPTransforms are a great way to encapsulate your business logic and make your pipelines easier to maintain, monitor, and understand. (The PTransform style guide is a great resource.) Below are some extra hints and tips to follow:Reuse transforms.Regardless of the size of your organization, you will often carry out many of the same transforms to a given dataset across many pipelines. Identify these transforms and create a library of them, treating them the same way you would a new API. Ensure they have a clear owner (ideally the data engineering team closest to that dataset)Ensure you have a clear way to understand and document how changes to those transforms will affect all pipelines that use them. Ensure these core transforms have well-documented characteristics across various lifecycle events, such as update/cancel/drain. By understanding the characteristics in the individual transforms, it will be easier to understand how your pipelines will behave as a whole during these lifecycle events.Use display name. Always use the name field to assign a useful, at-a-glance name to the transform. This field value is reflected in the Cloud Dataflow monitoring UI and can be incredibly useful to anyone looking at the pipeline. It is often possible to identify performance issues without having to look at the code using only the monitoring UI and well-named transforms. The diagram below shows part of a pipeline with reasonable display names in the Cloud Dataflow UI—you can see it’s pretty easy to follow the pipeline flow, making troubleshooting easier.Use the dead letter pattern. Ensure you have a consistent policy to deal with errors and issues across all of your transforms. Make heavy use of the dead letter pattern whenever possible. This pattern involves branching errors away from the main pipeline path into either a dead letter queue for manual intervention or a programmatic correction path. Remember versioning. Ensure that all transforms have versions and that all pipelines also have versions. If possible, keep this code in your source code repositories for a period that matches the retention policies for the data that flowed through these pipelines.Treat schema as code. Data schemas are heavily intertwined with any pipeline. Though it’s not always possible, try to keep your definitions of schemas as code in your source repository, distinct from your pipeline code. Ideally, the schema will be owned and maintained by the data owners, with your build systems triggering global searches when things break.  Use idempotency. When possible, ensure that most, if not all, mutations external to the pipeline are idempotent. For example, an update of a data row in a database within a DoFn is idempotent. Cloud Dataflow provides for exactly-once processing within the pipeline, with transparent retries within workers. But if you are creating non-idempotent side effects outside of the pipeline, these calls can occur more than once. It’s important to clearly document transforms and pipeline that have this type of effect. It’s also important to understand how this may vary across different lifecycle choices like update, cancel and drain. Testing your Cloud Dataflow pipelinesThere’s plenty of detail on the Apache Beam site to help with the general testing of your transforms and pipelines. Lifecycle testingYour end-to-end data pipeline testing should include lifecycle testing, particularly analyzing and testing all the different update/drain/cancel options. That’s true even if you intend to only use update in your streaming pipelines. This lifecycle testing will help you understand the interactions the pipeline will have with all data sinks and and side effects you can’t avoid. For example, you can see how it affects your data sink when partial windows are computed and added during a drain operation.You should also ensure you understand the interactions between the Cloud Dataflow lifecycle events and the lifecycle events of sources and sinks, like Cloud Pub/Sub. For example, you might see what interaction, if any, features like replayable or seekable would have with streaming unbounded sources. This also helps to understand the interaction with your sinks and sources if they have to go through a failover or recovery situation. Of special importance here is the interaction of such events with watermarks. Sending data with historic timestamps into a streaming pipeline will often result in those elements being treated as late data. This may be semantically correct in the context of the pipeline, but not what you intend in a recovery situation. Testing by cloning streaming production environments One of the nice things about streaming data sources like Cloud Pub/Sub is that you can easily attach extra subscriptions to a topic. This comes at an extra cost, but for any major updates, you should consider cloning the production environment and running through the various lifecycle events. To clone the Cloud Pub/Sub stream, you can simply create a new subscription against the production topic. You may also consider doing this activity on a regular cadence, such as after you have had a certain number of minor updates to your pipelines. The other option this brings is the ability to carry out A/B testing. This can be dependent on the pipeline and the update, but if the data you’re streaming can be split (for example, on entry to the topic) and the sinks can tolerate different versions of the transforms, then this gives you a great way to ensure everything goes smoothly in production. Setting up monitoring and alertsApache Beam allows a user to define custom metrics, and with the Cloud Dataflow runner, these custom metrics can be integrated with Google Stackdriver. We recommend that in addition to the standard metrics, you build custom metrics into your pipeline that reflect your organization’s service-level objectives (SLOs). You can then apply alerts to these metrics at various thresholds to take remediation action before your SLOs are violated. For example, in a pipeline that processes application clickstreams, seeing lots of errors of a specific type could indicate a failed application deployment. Set these to show up as metrics and set up alerts in Stackdriver for them. Following these good practices will help when you do have to debug a problematic pipeline. Check out the hints and tips outlined in the troubleshooting your pipeline Cloud Dataflow documentation for help with that process.  Pipeline deployment life cyclesHere’s what to consider when you’re establishing pipelines in Cloud Dataflow.Initial deploymentWhen deploying a pipeline, you can give it a unique name. This name will be used by the monitoring tools when you go to the Cloud Dataflow page (or anywhere else names are visible, like in the gsutil commands). It is also has to be unique within the project. This is a great safety feature, as it prevents two copies of the same pipeline from accidentally being started. You can ensure this by always having the –name parameter set on all pipelines. Get into the habit of doing this with your test and dev pipelines as well. Tips for improving for streaming pipeline lifecycleThere are some things you can do to build better data pipelines. Here are some tips.Create backups of your pipelinesSome sources, like Cloud Pub/Sub and Apache Kafka, let you replay a stream from a specific point in processing time. Specific mechanisms vary: For example, Apache Kafka supports reads from a logical offset, while Cloud Pub/Sub uses explicitly taken snapshots or wall-clock times. Your application must take the specifics of the replay mechanism into account to ensure you replay all the data you need and do not introduce unexpected duplication. This feature (when available) lets you create a backup of the stream data for reuse if required. One example would be a rollout of a new pipeline that has a subtle bug not caught in unit testing, end-to-end testing, or at the A/B testing phase. The ability to replay the stream in those ideally rare situations allows the downstream data to be corrected without doing a painful one-off data correction exercise. You can also use the Cloud Pub/Sub snapshot and seek features for this purpose. Create multiple replica pipelinesThis option is a pro tip from the Google SRE team. It is similar to the backup pipeline option. However, rather than creating the pipeline to be replayed in case of an issue, you spin up one or more pipelines to process the same data. If the primary pipeline update introduces an unexpected bug, then one of the replicas is used instead to read the production data. Update a running pipelineUpdate is a power feature of the Cloud Dataflow runner that allows you to update an existing pipeline in situ, with in-flight data processed within the new pipeline. This option is available in a lot of situations, though as you would expect, there are certain pipeline changes that will prevent the use of this feature. This is why it’s important to understand the behavior of various transforms and sinks during different lifecycle events as part of your standard SRE procedures and protocols.If it’s not possible to make the update, the compatibility check process will let you know during the update process. In that case, you should explore the Cloud Dataflow drain feature. The drain feature will stop pulling data from your sources and finish processing all data remaining in the pipeline. You may end up with incomplete window aggregations in your downstream sinks. This may not be an issue for your use case, but if it is you should have a protocol in place to ensure that when the new pipeline is started, the partial data is dealt with in a way that matches your business requirements. Note that a drain operation is not immediate, as Cloud Dataflow needs to process all the source data that was read. In some cases, it might take several hours until the time windows are complete. However, the upside of drain is that all data that was read from sources and acknowledged will be processed by the pipeline. If you want to terminate a pipeline immediately, you can use the cancel operation, which stops a pipeline almost immediately, but will result in all of the read and acknowledged data being dropped. Learn more about updating a pipeline, and find out more about CI/CD pipelines with Cloud Dataflow.
Quelle: Google Cloud Platform

Smartphones: Huawei bleibt vorsichtig optimistisch

Auf einem Presse-Event in Berlin bekräftigt Huawei sein Zukunftsversprechen: Wer sich ein Huawei-Smartphone kauft, soll es auch künftig wie gewohnt nutzen können. Auf Detailfragen bekommen wir allerdings keine genauen Antworten – die Diplomatie steht im Vordergrund, stärker als das neue 5G-Smartphone. Eine Analyse von Tobias Költzsch (Huawei, Smartphone)
Quelle: Golem

Azure FXT Edge Filer now generally available

Scaling and optimizing hybrid network-attached storage (NAS) performance gets a boost today with the general availability of the Microsoft Azure FXT Edge Filer, a caching appliance that integrates on-premises network-attached storage and Azure Blob Storage. The Azure FXT Edge Filer creates a performance tier between compute and file storage and provides high-throughput and low-latency network file system (NFS) to high-performance computing (HPC) applications running on Linux compute farms, as well as the ability to tier storage data to Azure Blob Storage.

Fast performance tier for hybrid storage architectures

The availability of Azure FXT Edge Filer today further integrates the highly performant and efficient technology that Avere Systems pioneered to the Azure ecosystem. The Azure FXT Edge Filer is a purpose-built evolution of the popular Avere FXT Edge Filer, in use globally to optimize storage performance in read-heavy workloads.

The new hardware model goes beyond top-line integration with substantial updates. It is now being manufactured by Dell and has been upgraded with twice as much memory and 33 percent more SSD. Two models with varying specifications are available today. With the new 6600 model, customers will see about a 40 percent improvement in read performance over the Avere FXT 5850. The appliance now supports hybrid storage architectures that include Azure Blob storage.

Edge filer hardware is recognized as a proven solution for storage performance improvements. With many clusters deployed around the globe, Azure FXT Edge Filer can scale performance separately from capacity to optimize storage efficiency. Companies large and small use the appliance to accelerate challenging workloads for processes like media rendering, financial simulations, genomic analysis, seismic processing, and wide area network (WAN) optimization. Now with new Microsoft Azure supported appliances, these workloads can run with even better performance and easily leverage Azure Blob storage for active archive storage capacity.

Rendering more faster

Visual effects studios have been long-time users of this type of edge appliance, as their rendering workloads frequently push storage infrastructures to their limits. When one of these companies, Digital Domain, heard about the new Azure FXT Edge Filer hardware, they quickly agreed to preview a 3-node cluster.

“I’ve been running my production renders on Avere FXT clusters for years and wanted to see how the new Azure FXT 6600 stacks up. Setup was easy as usual, and I was impressed with the new Dell hardware. After a week of lightweight testing, I decided to aim the entire render farm at the FXT 6600 cluster and it delivered the performance required without a hiccup and room to spare.”

Mike Thompson, Principal Engineer, Digital Domain

Digital Domain has nine locations in the United States, China, and India.

Manage heterogeneous storage resources easily

Azure FXT Edge Filers help keep analysts, artists, and engineers productive, ensuring that applications aren’t affected by storage latency. And storage administrators can easily manage these heterogeneous pools of storage in a single file system namespace and through a single mountpoint. Users access their files from a single mount point, whether they are stored in on-premises NAS or in Azure Blob storage.

Expanding a cluster to meet growing demands is as easy as adding additional nodes. The Azure FXT Edge Filer scales from three to 24 nodes, allowing even more productivity in peak periods. This scale helps companies avoid overprovisioning expensive storage arrays and enables moving to the cloud at the user’s own pace.

Gain low latency hybrid storage access

Azure FXT Edge Filers deliver high throughput and low latency for hybrid storage infrastructure supporting read-heavy HPC workloads. Azure FXT Edge Filers support storage architectures with NFS and server message block (SMB) protocol support for NetApp and Dell EMC Isilon NAS systems, as well as cloud APIs for Azure Blob storage and Amazon S3.

Customers are using the flexibility of the Azure FXT Edge Filer to move less frequently used data to cloud storage resources, while keeping files accessible with minimal latency. These active archives enable organizations to quickly leverage media assets, research, and other digital information as needed.

Enable powerful caching of data

Software on the Azure FXT Edge Filers identifies the most in-demand or hottest data and caches it closest to compute resources, whether that data is stored down the hall, across town, or across the world. With a cluster connected, the appliances take over, moving data as it warms and cools to optimize access and use of the storage.

Get started with Azure FXT Edge Filers

Whether you are currently running Avere FXT Edge Filers and are looking to upgrade to the latest hardware to increase performance or expanding your clusters or you are new to the technology, the process to get started is the same. You can request information by completing this online form or by reaching out to your Microsoft representative.

Microsoft will work with you to configure the optimal combination of software and hardware for your workload and facilitate its purchase and installation.

Resources

Azure FXT Edge Filer preview blog

Azure FXT Edge Filer product information

Azure FXT Edge Filer documentation

Azure FXT Edge Filer data sheet
Quelle: Azure