How AIOps helps Nextel Brazil predict and prevent network outages

Mobile smartphones are playing a significant role in the lives and productivity of people around the world. Consider these statistics about smartphone usage from TechJury.

Internet users worldwide who visit the Web on a mobile device: 67%
Percent of emails read on mobile devices: 49.1%
Smartphone users are addicted to their phones: 66%

Clearly, many people today don’t want to (or can’t) be without their smartphones. And like all telecommunications companies, Nextel Brazil is trying to be as customer centric as possible. We strive to make customer service part of the DNA of the company and treat customers as our primary asset, because they are.
I have a great team of 75 people working with me. There are three shifts working morning, noon and night. We do the best we can to satisfy our customer needs because we know our subscribers depend on their mobile phones to work and live their lives. Every second that there’s a network outage and customers don’t have service, we have to be there for them. That’s what being customer centric means to us, especially in operations.
Reducing network outages and meantime to repair with IBM Netcool
Mean time to repair (MTTR) is the key performance indicator for us. We started our partnership with IBM when we began using IBM Netcool Operations Insight software to correlate alarms and get to the root cause of problems faster. We have more than 25,000 established network elements and multiple management systems being monitored by Netcool. The solution has helped us reduce the MTTR to receive an alarm and solve a problem in the field or with some configuration from 30 minutes to less than five minutes.
Still people don’t even want to wait one minute, never mind five minutes, to get their services recovered. As our services and network increase in complexity, so has the amount of data generated.
After approximately three years of maturing this solution, we started to say, “Hey, we can do better. We can be more proactive to treat the problem. Let’s start looking at the data. Let’s start with some analytics into the data.”
We have to be able to predict and to be prepared for network problems, because we know that they will happen. This is our day to day. We wanted to be better prepared for incidents and be able to make adjustments to avoid a network outage.
Moving from reactive to predictive with AIOps
We began working with IBM Watson technology to implement artificial intelligence for IT operations (AIOps). Watson helps us to categorize all the incidents, so we have a better understanding of what is happening in the network, such as if the outage is due to a utilities problem. More than just knowing we have a problem, Watson tells us why we have the problem. Now we can group incidents together and focus on fixing things at the source.
We’re also working with The Weather Company, an IBM Business to predict weather-related incidents and prevent them from impacting service. Our Network Operations team has a high dependence on utility companies because our cell towers are based on electric power. We will have a problem when they have a problem, and they are very dependent on the weather.
With the Weather Company data, we can correlate and look into historical data and know every time that we have a certain threshold of rain, of wind speed, of soil moisture, or whatever set of parameters, that we will have a problem with cell towers in this region.
If we know that one of these conditions is going to happen in the next 72 hours, we can be more prepared to act. As a result, we might send a small generator or extra batteries to the site to keep it up longer. By better knowing the probability and duration of the fault, we can prepare such that we can help avoid an outage for our customers in that region.
AIOps with Watson and The Weather Company data has helped us complete the journey to being predictive in network operations. It’s a great feeling to know that we don’t just have to wait for something terrible to happen and then react to it. We can actually do something about it before it happens. And this means that our customers who depend on their mobile phone are less likely to be without service.
Watch the related video.
The post How AIOps helps Nextel Brazil predict and prevent network outages appeared first on Cloud computing news.
Quelle: Thoughts on Cloud

Working with Qubole to tackle the challenges of machine learning at an enterprise scale

With virtually unlimited storage and compute resources, the cloud has emerged as the prime location for enterprises doing large-scale big data and machine learning projects. Enterprises need ever more sophisticated technology to quickly innovate with data projects in the cloud, without compromising ease of use, scale, and security. At Google Cloud, we’re building cloud infrastructure that’s flexible and open-source-centric to meet customer needs.Our partners are essential to our mission of helping customers grow their tech capabilities and their businesses. Qubole, a recently announced Google Cloud Platform (GCP) partner, offers an integrated cloud-native data analytics platform. Qubole provides GCP users with a unified, self-service platform where data scientists and data engineers can collaborate using familiar tools and languages, as well as performance-optimized versions of open source data processing engines. The Qubole data platform provides a range of optimized open-source engines, including Apache Spark, TensorFlow, Presto, Airflow, Hadoop, Hive, and more. With Qubole, you can combine and analyze data from BigQuery and data lakes on Cloud Storage super quickly.Building modern machine learning modelsWe’ve heard great stories from customers using the Qubole platform for powerful analytics, including Recursion Pharmaceuticals, True Fit, and AgilOne, which provides a customer data platform for its enterprise users. They support real-time use cases and large volumes of data. To do that, AgilOne operates complex machine learning (ML) models and stores vast quantities of data using Qubole and GCP for its 150-plus customers, including lululemon, Travelzoo, and TUMI. AgilOne Cortex is a machine learning framework that uses supervised machine learning models to predict customer events such as purchase, subscription, and engagement. It segments customers together based on interest and behavior using unsupervised learning techniques. AgilOne Cortex’s recommender system models lets customers orchestrate offers and messages to customers on a one-on-one basis. AgilOne uses cloud platforms to perform close to one billion predictions every day, averaging dozens of millions of customer predictions for each client across all its models.In order to meet the challenges of such vast amounts of data and millions of predictions , AgilOne chose Qubole and GCP to better automate the provision of machine learning data-processing resources based on workload, while allowing for portability across cloud providers; eliminating prototyping bottlenecks; supporting the seamless orchestration of jobs; and automating cluster management.AgilOne now runs a variety of workloads for querying data, running ML models, orchestrating ML workflows, and more on Qubole—all on a single platform with optimized versions of Apache Spark, Apache Airflow, Zeppelin Notebooks, and leveraging Qubole’s APIs to automate tasks. Using GCP and Qubole, AgilOne has seen some key benefits:Elimination of critical bottlenecks through intelligent, autonomous and self-service provisioning of compute resources for the data science models.Increased efficiency for AgilOne’s machine learning and ops teams.Improved prototyping and efficient movement of ML models into production.Simplified and reduced time to production, transitioning to GCP through a consistent user experience, tools, and technologies.Efficient orchestration of machine-learning model lifecycle through Airflow.Automated tasks end-to-end with Qubole APIs.Improved customer support and added zero-downtime upgrades and roll-back capabilities.AgilOne also uses Google Cloud Storage for its real-time data store of its customers’ transaction and event data. This repository of cleansed, deduped, and enriched data serves as the master customer record for all reporting, analyses, machine learning models, and advanced segmentation.Limiting bottlenecks, simplifying cluster managementUsing Qubole and GCP, AgilOne’s data science team can make cluster management and cluster provisioning more self-service, smarter, and less dependent on operations teams. They’re now able to make delivery of ML models more agile. AgilOne data teams now rely less on the operations team, since infrastructure is provisioned automatically through Qubole. Qubole on GCP means that it’s now easier to provision new and larger clusters with different sets of permissions, install dependencies on VMs, maintain stable prototyping environments, and upgrade software. The data science team’s variable infrastructure needs are now addressed with intelligent automation—spinning up and releasing clusters and different types of nodes as needed. In Qubole’s managed Zeppelin environement, AgilOne can prototype its Python/Pyspark/Scala applications. The comprehensive quality assurance and support, zero-downtime software upgrades, and rollback capabilities help to add stability for AgilOne’s ML operations. Eliminating bottlenecks has let the company build and test new models at lightning speed. This translates to a much faster go-to-market and onboarding of new clients.Finding improved executionAgilOne Cortex requires a powerful orchestration system to run and monitor dozens of models for all clients, and to run each model across all users every day. Since Qubole and GCP bring open source options, AgilOne’s data science team can use configuration-as-code workflow engine Airflow. This has allowed AgilOne to better manage the lifecycle of its ML workflows by providing easy maintenance, versioning, and testing. Qubole also provides customers like AgilOne a comprehensive set of APIs critical for end-to-end automation. This includes automating such tasks as starting and stopping clusters, submitting a Spark job or changing the Spark configuration, generating reports, increasing the timeout, and more.Looking ahead with cloudAs its business continues to rapidly expand, the need for more data insights and more models increases. AgilOne will look to use Qubole and GCP for running ad-hoc queries for data discovery, exploration, and analyses. From a cluster-management perspective, AgilOne wants to further use Qubole’s intelligent management of Google’s Preemptible VMs and heterogeneous cluster management capabilities to lower its ML processing costs without compromising reliability. Learn more about AgilOne on Qubole and GCP and about AgilOne’s technology.
Quelle: Google Cloud Platform

How Full is my Cluster – Part 5: A Capacity Management Dashboard

Introduction
This is the fifth installment in the series regarding capacity management with OpenShift.

In the first post, we saw the basic notions on how resource management works in OpenShift and a way to visualize node and pod resource consumption using kube-ops-view.
In the second post, we illustrated some best practices on how to protect the nodes from being overcommitted.
In the third post, we presented best practices on how to setup a capacity management process and which metrics to follow.
In the fourth post, we introduced the vertical pod autoscaler operator was introduced along with its potential use as a way to estimate the correct sizing of pods.

In this post, we will introduce a ready to use dashboard for which you can base your capacity management process.
The primary goal of this dashboard is to answer a very specific question: Do I need a new node?
Naturally, this is not the only question for a well conceived capacity management process. But, it is certainly a good starting point. It can be used as a foundation for a more sophisticated dashboard that will fully support your capacity management process.
The Dashboard
This capacity management dashboard works on a group of nodes and will help deciding if you need more or less nodes for that group. You can have many node groups in your cluster, so the user can select the node group they want to work on with a drop down list box.
The following illustrates the contents of the capacity management dashboard:
This primary metrics presented are for memory and cpu. The way metrics are collected is the same for memory and CPU. The metrics always refer to aggregates within the selected node group. The Dashboard displays three metrics:

Quota/Allocatable ratio
Request/Allocatable ratio
Usage/Allocatable ratio

If you may recall from the second post on overcommitment, Allocatable is the amount of resources actually available to pods on a given node. This value is calculated out of the total capacity of a node after reserved resources for the operating system and other basic services, such as the container runtime service and the node service, have been accounted for.
In OpenShift 4.x, Prometheus Alerts are also set up during installation. These alerts trigger when the when the Quota/Allocatable ratio passes 100% and when the Request/Allocatable ratio passes 80%. 
You can find the dashboard and its installation instructions at this repository.  In order to work properly, this dashboard requires the cluster nodes and projects to be organized in a certain way, see at the end of the post more information on this.
In the next section we are going to explain how the collected metrics can be interpreted and used to deduce or forecast whether you need more nodes.
Interpretation of quota/allocatable ratio
This metric is collected as in the formula:

This metric can be interpreted as: what has been promised (the granted quota) vs the actual available amount (the allocatable).
Changes to his metric do not occur frequently and fluctuations  typically only occur when new projects are introduced. This metrics is suitable for making long term projections of the needed cluster capacity. Organizations with OpenShift deployments where the nodes cannot be scaled quickly (non-cloud deployments such as bare metal deployments) should most likely use this metric to make a decision on when to scale up nodes. 
Depending on your tolerance to risk, it can be ok for this metric to be above 100%, which signals that you have overcommitted the cluster. 
Interpretation of request/allocatable ratio
This metric is collected as in the formula:

This metrics can be interpreted as: the value tenants are estimating (if you recall from blog post #3 of this series on developing a capacity management process, resource requests on containers should correspond to the estimated value that will be needed at runtime) they would need versus the  actual available amount.
This metric is more volatile than the previous metric because the value changes when new pods are added/removed and is more suitable for making a scaling decision when a new node can be provisioned quickly. This is likely to happen when you are running on cloud environments. 
The OpenShift 4.x cluster autoscaler uses this metric indirectly. In fact, it will trigger the addition of a new node if a pod is stuck in a pending state if it cannot be allocated due to a lack of resources. This is approximately the same as triggering a scale up when this metric is measuring close to 100%.
With an approach based node groups, however, we can be more flexible than the cluster autoscaler because we can decide at which threshold we want to scale. For example, with this metric, we can be proactive and add a new node when the ratio hits 80%, so that no pods need to wait before it can be provisioned.
Interpretation of Usage/Allocatable ratio
This metric is collected as in the following formula:

This metrics can be interpreted as what is being currently used vs the amount of available resources. This is clearly the most volatile metric, as it depends on an instant by instant load of running pods.
This metric is generally poor for making capacity forecasts because the values fluctuate too often. However, it can be used to, provide a general overview of the cluster. 
The primary function of this metric is to be able to compare the actual (what we are using) with the estimates (the sum of requests from the previous ratio). If these two measures diverge, it means that our tenants are not estimating their resources correctly, and corrective actions are needed.
Another function of this metric is that it allows us to judge whether we have enough resources for the current workload. However, we have to be cognizant that the fact that at the node group level there are enough resources does not guarantee by itself that individual nodes are not resource constrained. For that reason, another set of metrics and alerts needs to be utilized. Fortunately, these node level metrics and alerts are included in the Prometheus setup that comes with OpenShift.
Assumptions
In order for the dashboard to be effective and accurate, there are several  assumptions that must be considered 
Nodes are grouped into non-overlapping groups. Groups are identified by a node label. The default for the label is nodegroup. Each node must belong to exactly one group. The capacity management dashboard will tell us if we need to add a node to a specific group. Groups can be used to manage zones or areas of the cluster that house different workloads and may need to scale independently. For example, there might be a cluster with high-priority workload nodes, normal priority workload nodes, PCI-dedicated nodes, GPU-enabled nodes, and so on. In OpenShift 3.x, there are always at least three groups: master, infranodes and worker nodes (obviously worker nodes can be fragmented further if needed). In OpenShift 4.x there is no such requirement that infranodes must exist, however the same concepts apply.
ClusterResourceQuotas are defined to limit tenants resource consumption. ClusterResourceQuota is an OpenShift-specific API object that allows for a quota to be defined across multiple projects as opposed to ResourceQuota, which is associated to just one project. The ability of a ClusterResourceQuota to be defined across several projects allows the administrator to choose the granularity at which quotas are applied, achieving higher flexibility. For example, an organization can choose to grant quotas at the application level (an application usually exist on multiple projects to support the environments as defined by a SDLC), the business capability level (usually business capability are provided by a set of applications) or finally, at the line of business level (a set of business capabilities). As a result, ClusterResourceQuotas also allow for a flexible showback/chargeback model should a company decide to enact those processes. 
ClusterResourceQuotas must be defined for memory and cpu requests. At the moment, these are the only monitored resources. Other resources can be put under quota, but they would be ignored by the dashboard.
Each ClusterResourceQuota refers to only one node group. For the dashboard to function properly, each ClusterResourceQuota must have the same label as the label used to determine the node group. For example:
apiVersion: quota.openshift.io/v1
kind: ClusterResourceQuota
metadata:
name: quota1
labels:
  nodegroup: group1
spec:
quota:
  hard:
    requests.cpu: “2”
    requests.memory: 4Gi
selector:
  labels:
    matchLabels:
      quota: quota1
Each tenant project refers to a ClusterResourceQuota and deploys resources to the corresponding nodegroup. Each tenant project must be controlled by one and only one ClusterResourceQuota. The project default node selector must be configured to select the nodes belonging to the node group the ClusterResourceQuota refers to. For example:

kind: Namespace
apiVersion: v1
metadata:
 name: p1q1
 labels:
   quota: quota1
 annotations:
   openshift.io/node-selector: nodegroup=group1
Non-tenant projects, such as for example administrative projects, do not have to be under quota. However, the dashboard shows more accurate information if every project deployed on the monitored node groups is under quota.
The recommended approach is to define node labeling at cluster setup and to define the ClusterResourceQuotas and the project during the application onboarding process. The application onboarding process is the set of steps a development team must go through to be able to operate an application on OpenShift. Most organizations have a formalized process detailing these steps.
The below pseudo entity relationship diagram represents the configuration one would attain by the end of this preparation:

Conclusions
In this article, a capacity management dashboard was introduced that can be used as the baseline for an organization’s capacity management process.
With OpenShift 4.x and the introduction of the cluster autoscaler, the urgency of having such a dashboard may be reduced. However, the autoscaler is not always an option and even when it is implemented, it is currently very reactive (it triggers only when a pod cannot be allocated due to lack of capacity). As a result, this dashboard should provide value in understanding capacity management with OpenShift
 
The post How Full is my Cluster – Part 5: A Capacity Management Dashboard appeared first on Red Hat OpenShift Blog.
Quelle: OpenShift