Cloud Monitoring further embraces open source by adding PromQL

As Kubernetes monitoring continues to standardize on Prometheus as a form factor, more and more developers are becoming familiar with Prometheus’ built-in query language, PromQL. Besides being bundled with Prometheus, PromQL is popular for being a simple yet expressive language for querying time series data. It’s been fully adopted by the community, with lots of great query repositories, sample playbooks, and trainings for PromQL available online. It’s the query language that Kubernetes developers know and love.Introducing PromQL in the Google Cloud Monitoring user interfaceCloud Monitoring is committed to open source interfaces such as Prometheus, OpenCensus, and OpenTelemetry. We believe that having common standards across the industry improves ease of use for everybody and helps customers avoid lock-in due to provider-specific conventions. A few months ago, we doubled down on our commitment to open source interfaces by releasing PromQL for over 1,500 free metrics in Cloud Monitoring, usable through self-hosted Grafana. Today, we are excited to announce that you can now use PromQL throughout the Cloud Monitoring user interface, including in Metrics Explorer and Dashboard Builder.While using Grafana for Cloud Monitoring metrics will continue to be supported, we recognize that many customers prefer to use a Google-hosted, SLO-backed visualization and dashboarding tool instead of running their own. Now, developers don’t have to learn a new query language or paradigm to use Cloud Monitoring’s UI — they can keep using the same language they already know and love, and newly joined team members who already know PromQL can quickly become fluent with Cloud Monitoring.Cloud Monitoring’s PromQL comes with autocompletion of metric names, label keys, and label values. You can use PromQL to query free Google Cloud system metrics, Kubernetes metrics, log-based metrics, custom metrics, and Prometheus metrics, and you can use PromQL even if you don’t use Managed Service for Prometheus. PromQL in Cloud Monitoring is enabled by default, meaning that using PromQL or Managed Service for Prometheus no longer requires you to configure, run, or scale self-hosted Grafana. You can use both the Cloud Monitoring UI and Grafana, depending on which works best for any given use case.How to get startedThis product is in preview and open to all Google Cloud users. You can query Cloud Monitoring metrics with PromQL by using the PromQL tab in Metrics Explorer or the Dashboard Builder.  PromQL-backed queries can be saved on your custom dashboards, and any dashboard chart can be opened in Metrics Explorer to perform ad hoc analysis using PromQL. To learn how to write PromQL for Google Cloud metrics, see PromQL for Cloud Monitoring metrics.To query Prometheus data alongside Cloud Monitoring metrics, you have to first get Prometheus data into the system. For instructions on configuring Managed Service for Prometheus ingestion, see Get started with managed collection.Related ArticleCloud Monitoring metrics, now in Managed Service for PrometheusQuery over 1,000 free Google Cloud metrics using PromQL. You can now view your Cloud Monitoring metrics alongside your Prometheus metrics.Read Article
Quelle: Google Cloud Platform

Advancing anomaly detection with AIOps—introducing AiDice

This blog post has been co-authored by Jeffrey He, Product Manager, AIOps Platform and Experiences Team.

In Microsoft Azure, we invest tremendous efforts in ensuring our services are reliable by predicting and mitigating failures as quickly as we can. In large-scale cloud systems, however, we may still experience unexpected issues simply due to the massive scale of the system. Given this, using AIOps to continuously monitor health metrics is fundamental to running a cloud system successfully, as we have shared in our earlier posts. First, we shared more about this in Advancing Azure service quality with artificial intelligence: AIOps. We also shared an example deep dive of how we use AIOps to help Azure in the safe deployment space in Advancing safe deployment with AIOps. Today, we share another example, this time about how AI is used in the field of anomaly detection. Specifically, we introduce AiDice, a novel anomaly detection algorithm developed jointly by Microsoft Research and Microsoft Azure that identifies anomalies in large-scale, multi-dimensional time series data. AiDice not only captures incidents quickly, it also provides engineers with important context that helps them diagnose issues more effectively, providing the best experience possible for end customers.

Why are AIOps needed for anomaly detection?

We need AIOps for anomaly detection because the data volume is simply too large to analyze without AI. In large-scale cloud environments, we monitor an innumerable number of cloud components, and each component logs countless rows of data. In addition, each row of data for any given cloud component might contain dozens of columns such as the timestamp, the hardware type of the virtual machine, the generation number, the OS version, the datacenter where the nodes hosting the virtual machine stay in, or the country. The structure of the data we have is essentially multi-dimensional time series data, which contains an exponential number of individual time series due to the various combinations of dimensions. This means that iterating through and monitoring every single time series is simply not practical—applying AIOps is necessary.

How did we approach this, before AiDice?

Before AiDice, the way we handled anomaly detection in large-scale, high-dimensional time series data was to conduct anomaly detection on a selected set of dimensions that were the most important. By focusing on a scoped subset, we would be able to detect anomalies within those combinations quickly. Once these anomalies were detected, engineers would then dive deeper into the issues, using pivot tables to drill down into the other dimensions not included to better diagnose the issue. Although this approach worked, we saw two key opportunities to improve the process. First, the old approach required a lot of manual effort by engineers to determine the exact pivot of anomalies. Second, the approach also limited the scope of direct monitoring by only allowing us to input a limited number of dimensions into our anomaly detection algorithms. Given these reasons, Microsoft Research and Azure worked together to develop AiDice, which improves both of these areas.

How do we approach this now with AiDice, and how does it work?

Now with AiDice, we can automatically localize pivots on time series data even if looking at dozens of dimensions at the same time. This allows us to add a lot more attributes, whether that be the hardware generation or hardware microcode, the OS version, or the networking agent version. Though this makes the search space much larger, AiDice encodes the problem as a combinatorial optimization problem, allowing it to search through the space more efficiently than traditional approaches. Brief details of AiDice are described below, but to see a full explanation of the algorithm, please see the paper published at the ESEC/FSE '20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020).

Part 1: AiDice algorithm—formulation as a search problem

The AiDice algorithm works by first turning the data into a search problem. Search nodes are formed by starting at a given pivot and building the relationships out to the neighbors. For example, if we take a node, "Country=USA, Datacenter=DC1, DiskType=SSD", we can form out the neighboring nodes by swapping, adding, or removing a dimension-value pair, as shown in the diagram below.

Part 2: AiDice algorithm—objective function

Next, the AiDice algorithm searches through the search space in a smart manner by maximizing an objective function that emphasizes two key components. First, the bigger the sudden burst or change in errors, the higher AiDice scores the objective function. Second, the higher the proportion of the errors that occur in this pivot in relation to the total number of errors, the higher AiDice scores the objective function. For example, if there are 5,000 total errors that occurred, it is more important to alert the user about the pivot that went from 3000 errors to 4000 errors than the pivot that went from 10 to 20 errors.

Part 3: Customization of alerts to reduce noise

Next, the alerts that AiDice produces need to be filtered and customized to be less noisy and more actionable since the results so far are optimized from a mathematical perspective but have not yet incorporated domain knowledge around the meaning of the input data. This step can vary widely depending on the nature of the input data, but an example could be that consecutive alerts that share the same error code may be grouped together to reduce the number of total alerts.

AiDice in action—an example

The following is a real example in which AiDice helped detect a real issue early on. The details are altered for confidentiality reasons.

We applied AiDice to monitor low memory error events in a certain type of virtual machine with more than a dozen dimensions of attribute information alongside the fault count, including the region, the datacenter location, the cluster, the build, the RAM, or the event type.
AiDice identified an increase in the number of low memory events on distinct nodes in a particular pivot, which indicated a memory leak.

Build=11.11111, Ram=00.0, ProviderName=Xxxxx-x-Xxxxxx, EventType=8888 (details have been altered for privacy).

When looking at the aggregate trend, this issue is hidden and without AiDice it would take manual effort to detect the exact location of the issue (see graphs below, data normalized for privacy).
The engineer responsible for the ticket looked at the alert and some example cases shown in the alerts to quickly able figure out what was going on.

In this real-world example, AiDice was able to detect an issue in a dimension combination that was causing a particular error type in an automatic fashion, quickly and efficiently. Soon after, the memory leak was discovered and Azure engineers were able to mitigate the issue.

Looking forward

Looking ahead, we hope to improve AiDice to make Azure even more resilient and reliable. Specifically, we plan to:

Support additional scenarios in Azure: AiDice is being applied to many scenarios in Azure already, but the algorithm has room to improve with respect to the types of metrics it can operate on. Microsoft Azure and the Microsoft Research team are working together to support more metric scenarios.
Prepare additional data feeds in Azure for AiDice: In addition to upgrading the AiDice algorithm itself to support more scenarios, we are also working to add supporting attributes to certain data sources to fully leverage the power of AiDice.

Learn more

Sign up for Microsoft Azure today.
Visit the Advancing Reliability Series.

Quelle: Azure

Bottlerocket wird jetzt von Amazon Inspector unterstützt

Bottlerocket, ein Linux-basiertes Betriebssystem, wurde speziell für die Ausführung von Containern entwickelt, und ist jetzt in Amazon Inspector integriert. Kunden, bei denen das Inspector-C2-Scanning bereits aktiviert ist, müssen keine weiteren Maßnahmen ergreifen. Wenn Amazon Inspector eine Schwachstelle entdeckt, empfiehlt es ein Update auf die Version von Bottlerocket, die diese Schwachstelle behebt.
Quelle: aws.amazon.com

AWS kündigt Amazon WorkSpaces Core an

Amazon WorkSpaces Core ist ein neuer, vollständig verwalteter Virtual Desktop Infrastructure (VDI)-Service, der die Sicherheit, globale Zuverlässigkeit und Kosteneffizienz von AWS mit bestehenden VDI-Verwaltungslösungen kombiniert. Mit WorkSpaces Core entfällt die Kapazitätsplanung oder die Aktualisierung der Infrastruktur, und die Bereitstellung, der Einsatz und die Verwaltung von VDI-Umgebungen werden vereinfacht. Mit WorkSpaces Core kannst du deine lokalen VDI-Lösungen zu AWS migrieren oder in einer hybriden Umgebung bereitstellen, die On-Premises-Ressourcen und die WorkSpaces-Core-Infrastruktur umfasst. Verwalte deine gesamte Benutzerbasis über eine einzige Konsole, ohne die Endbenutzer zu beeinträchtigen oder deine IT-Mitarbeiter erneut zu schulen.
Quelle: aws.amazon.com