Kubernetes Replication Controller, Replica Set and Deployments: Understanding replication options

The post Kubernetes Replication Controller, Replica Set and Deployments: Understanding replication options appeared first on Mirantis | Pure Play Open Cloud.
As a container management tool, Kubernetes was designed to orchestrate multiple containers and replication, and in fact there are currently several ways to do it. In this article, we&;ll look at three options: Replication Controllers, Replica Sets, and Deployments.
What is Kubernetes replication for?
Before we go into how you would do replication, let&8217;s talk about why.  Typically you would want to replicate your containers (and thereby your applications) for several reasons, including:

Reliability: By having multiple versions of an application, you prevent problems if one or more fails.  This is particularly true if the system replaces any containers that fail.
Load balancing: Having multiple versions of a container enables you to easily send traffic to different instances to prevent overloading of a single instance or node. This is something that Kubernetes does out of the box, making it extremely convenient.
Scaling: When load does become too much for the number of existing instances, Kubernetes enables you to easily scale up your application, adding additional instances as needed.

Replication is appropriate for numerous use cases, including:

Microservices-based applications: In these cases, multiple small applications provide very specific functionality.
Cloud native applications: Because cloud-native applications are based on the theory that any component can fail at any time, replication is a perfect environment for implementing them, as multiple instances are baked into the architecture.
Mobile applications: Mobile applications can often be architected so that the mobile client interacts with an isolated version of the server application.

Kubernetes has multiple ways in which you can implement replication.
Types of Kubernetes replication
In this article, we&8217;ll discuss three different forms of replication: the Replication Controller, Replica Sets, and Deployments.
Replication Controller
The Replication Controller is the original form of replication in Kubernetes.  It&8217;s being replaced by Replica Sets, but it&8217;s still in wide use, so it&8217;s worth understanding what it is and how it works.

A Replication Controller is a structure that enables you to easily create multiple pods, then make sure that that number of pods always exists. If a pod does crash, the Replication Controller replaces it.

Replication Controllers also provide other benefits, such as the ability to scale the number of pods, and to update or delete multiple pods with a single command.

You can create a Replication Controller with an imperative command, or declaratively, from a file.  For example, create a new file called rc.yaml and add the following text:
apiVersion: v1
kind: ReplicationController
metadata:
 name: soaktestrc
spec:
 replicas: 3
 selector:
   app: soaktestrc
 template:
   metadata:
     name: soaktestrc
     labels:
       app: soaktestrc
   spec:
     containers:
     – name: soaktestrc
       image: nickchase/soaktest
       ports:
       – containerPort: 80
Most of this structure should look familiar from our discussion of Deployments; we&8217;ve got the name of the actual Replication Controller (soaktestrc) and we&8217;re designating that we should have 3 replicas, each of which are defined by the template.  The selector defines how we know which pods belong to this Replication Controller.

Now tell Kubernetes to create the Replication Controller based on that file:
# kubectl create -f rc.yaml
replicationcontroller “soaktestrc” created
Let&8217;s take a look at what we have using the describe command:
# kubectl describe rc soaktestrc
Name:           soaktestrc
Namespace:      default
Image(s):       nickchase/soaktest
Selector:       app=soaktestrc
Labels:         app=soaktestrc
Replicas:       3 current / 3 desired
Pods Status:    3 Running / 0 Waiting / 0 Succeeded / 0 Failed
No volumes.
Events:
 FirstSeen     LastSeen        Count   From                            SubobjectPath   Type   Reason                   Message
 ———     ——–        —–   —-                            ————-   ————–                  ——-
 1m            1m              1       {replication-controller }                       Normal SuccessfulCreate Created pod: soaktestrc-g5snq
 1m            1m              1       {replication-controller }                       Normal SuccessfulCreate Created pod: soaktestrc-cws05
 1m            1m              1       {replication-controller }                       Normal SuccessfulCreate Created pod: soaktestrc-ro2bl
As you can see, we&8217;ve got the Replication Controller, and there are 3 replicas, of the 3 that we wanted.  All 3 of them are currently running.  You can also see the individual pods listed underneath, along with their names.  If you ask Kubernetes to show you the pods, you can see those same names show up:
# kubectl get pods
NAME               READY     STATUS    RESTARTS   AGE
soaktestrc-cws05   1/1       Running   0          3m
soaktestrc-g5snq   1/1       Running   0          3m
soaktestrc-ro2bl   1/1       Running   0          3m
Next we&8217;ll look at Replica Sets, but first let&8217;s clean up:
# kubectl delete rc soaktestrc
replicationcontroller “soaktestrc” deleted

# kubectl get pods
As you can see, when you delete the Replication Controller, you also delete all of the pods that it created.
Replica Sets
Replica Sets are a sort of hybrid, in that they are in some ways more powerful than Replication Controllers, and in others they are less powerful.

Replica Sets are declared in essentially the same way as Replication Controllers, except that they have more options for the selector.  For example, we could create a Replica Set like this:
apiVersion: extensions/v1beta1
kind: ReplicaSet
metadata:
 name: soaktestrs
spec:
 replicas: 3
 selector:
   matchLabels:
     app: soaktestrs
 template:
   metadata:
     labels:
       app: soaktestrs
  environment: dev
   spec:
     containers:
     – name: soaktestrs
       image: nickchase/soaktest
       ports:
       – containerPort: 80
In this case, it&8217;s more or less the same as when we were creating the Replication Controller, except we&8217;re using matchLabels instead of label.  But we could just as easily have said:

spec:
 replicas: 3
 selector:
    matchExpressions:
     – {key: app, operator: In, values: [soaktestrs, soaktestrs, soaktest]}
     – {key: teir, operator: NotIn, values: [production]}
 template:
   metadata:

In this case, we&8217;re looking at two different conditions:

The app label must be soaktestrc, soaktestrs, or soaktest
The tier label (if it exists) must not be production

Let&8217;s go ahead and create the Replica Set and get a look at it:
# kubectl create -f replicaset.yaml
replicaset “soaktestrs” created

# kubectl describe rs soaktestrs
Name:           soaktestrs
Namespace:      default
Image(s):       nickchase/soaktest
Selector:       app in (soaktest,soaktestrs),teir notin (production)
Labels:         app=soaktestrs
Replicas:       3 current / 3 desired
Pods Status:    3 Running / 0 Waiting / 0 Succeeded / 0 Failed
No volumes.
Events:
 FirstSeen     LastSeen        Count   From                            SubobjectPath   Type    Reason                   Message
 ———     ——–        —–   —-                            ————-   ————–                   ——-
 1m            1m              1       {replicaset-controller }                        Normal  SuccessfulCreate Created pod: soaktestrs-it2hf
 1m            1m              1       {replicaset-controller }                       Normal  SuccessfulCreate Created pod: soaktestrs-kimmm
 1m            1m              1       {replicaset-controller }                        Normal  SuccessfulCreate Created pod: soaktestrs-8i4ra

# kubectl get pods
NAME               READY     STATUS    RESTARTS   AGE
soaktestrs-8i4ra   1/1       Running   0          1m
soaktestrs-it2hf   1/1       Running   0          1m
soaktestrs-kimmm   1/1       Running   0          1m
As you can see, the output is pretty much the same as for a Replication Controller (except for the selector), and for most intents and purposes, they are similar.  The major difference is that the rolling-update command works with Replication Controllers, but won&8217;t work with a Replica Set.  This is because Replica Sets are meant to be used as the backend for Deployments.

Let&8217;s clean up before we move on.
# kubectl delete rs soaktestrs
replicaset “soaktestrs” deleted

# kubectl get pods
Again, the pods that were created are deleted when we delete the Replica Set.
Deployments
Deployments are intended to replace Replication Controllers.  They provide the same replication functions (through Replica Sets) and also the ability to rollout changes and roll them back if necessary.

Let&8217;s create a simple Deployment using the same image we&8217;ve been using.  First create a new file, deployment.yaml, and add the following:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
 name: soaktest
spec:
 replicas: 5
 template:
   metadata:
     labels:
       app: soaktest
   spec:
     containers:
     – name: soaktest
       image: nickchase/soaktest
       ports:
       – containerPort: 80
Now go ahead and create the Deployment:
# kubectl create -f deployment.yaml
deployment “soaktest” created
Now let&8217;s go ahead and describe the Deployment:
# kubectl describe deployment soaktest
Name:                   soaktest
Namespace:              default
CreationTimestamp:      Sun, 05 Mar 2017 16:21:19 +0000
Labels:                 app=soaktest
Selector:               app=soaktest
Replicas:               5 updated | 5 total | 5 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  1 max unavailable, 1 max surge
OldReplicaSets:         <none>
NewReplicaSet:          soaktest-3914185155 (5/5 replicas created)
Events:
 FirstSeen     LastSeen        Count   From                            SubobjectPath   Type    Reason                   Message
 ———     ——–        —–   —-                            ————-   ————–                   ——-
 38s           38s             1       {deployment-controller }                        Normal  ScalingReplicaSet        Scaled up replica set soaktest-3914185155 to 3
 36s           36s             1       {deployment-controller }                        Normal  ScalingReplicaSet        Scaled up replica set soaktest-3914185155 to 5
As you can see, rather than listing the individual pods, Kubernetes shows us the Replica Set.  Notice that the name of the Replica Set is the Deployment name and a hash value.

A complete discussion of updates is out of scope for this article &; we&8217;ll cover it in the future &8212; but couple of interesting things here:

The StrategyType is RollingUpdate. This value can also be set to Recreate.
By default we have a minReadySeconds value of 0; we can change that value if we want pods to be up and running for a certain amount of time &8212; say, to load resources &8212; before they&8217;re truly considered &;ready&;.
The RollingUpdateStrategy shows that we have a limit of 1 maxUnavailable &8212; meaning that when we&8217;re updating the Deployment, we can have up to 1 missing pod before it&8217;s replaced, and 1 maxSurge, meaning we can have one extra pod as we scale the new pods back up.

As you can see, the Deployment is backed, in this case, by Replica Set soaktest-3914185155. If we go ahead and look at the list of actual pods&;
# kubectl get pods
NAME                        READY     STATUS    RESTARTS   AGE
soaktest-3914185155-7gyja   1/1       Running   0          2m
soaktest-3914185155-lrm20   1/1       Running   0          2m
soaktest-3914185155-o28px   1/1       Running   0          2m
soaktest-3914185155-ojzn8   1/1       Running   0          2m
soaktest-3914185155-r2pt7   1/1       Running   0          2m
&8230; you can see that their names consist of the Replica Set name and an additional identifier.
Passing environment information: identifying a specific pod
Before we look at the different ways that we can affect replicas, let&8217;s set up our deployment so that we can see what pod we&8217;re actually hitting with a particular request.  To do that, the image we&8217;ve been using displays the pod name when it outputs:
<?php
$limit = $_GET[‘limit’];
if (!isset($limit)) $limit = 250;
for ($i; $i < $limit; $i++){
    $d = tan(atan(tan(atan(tan(atan(tan(atan(tan(atan(123456789.123456789))))))))));
}
echo “Pod “.$_SERVER[‘POD_NAME’].” has finished!n”;
?>
As you can see, we&8217;re displaying an environment variable, POD_NAME.  Since each container is essentially it&8217;s own server, this will display the name of the pod when we execute the PHP.

Now we just have to pass that information to the pod.

We do that through the use of the Kubernetes Downward API, which lets us pass environment variables into the containers:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
 name: soaktest
spec:
 replicas: 3
 template:
   metadata:
     labels:
       app: soaktest
   spec:
     containers:
     – name: soaktest
       image: nickchase/soaktest
       ports:
       – containerPort: 80
       env:
       – name: POD_NAME
         valueFrom:
           fieldRef:
             fieldPath: metadata.name
As you can see, we&8217;re passing an environment variable and assigning it a value from the Deployment&8217;s metadata.  (You can find more information on metadata here.)

So let&8217;s go ahead and clean up the Deployment we created earlier&8230;
# kubectl delete deployment soaktest
deployment “soaktest” deleted

# kubectl get pods
&8230; and recreate it with the new definition:
# kubectl create -f deployment.yaml
deployment “soaktest” created
Next let&8217;s go ahead and expose the pods to outside network requests so we can call the nginx server that is inside the containers:
# kubectl expose deployment soaktest –port=80 –target-port=80 –type=NodePort
service “soaktest” exposed
Now let&8217;s describe the services we just created so we can find out what port the Deployment is listening on:
# kubectl describe services soaktest
Name:                   soaktest
Namespace:              default
Labels:                 app=soaktest
Selector:               app=soaktest
Type:                   NodePort
IP:                     11.1.32.105
Port:                   <unset> 80/TCP
NodePort:               <unset> 30800/TCP
Endpoints:              10.200.18.2:80,10.200.18.3:80,10.200.18.4:80 + 2 more…
Session Affinity:       None
No events.
As you can see, the NodePort is 30800 in this case; in your case it will be different, so make sure to check.  That means that each of the servers involved is listening on port 30800, and requests are being forwarded to port 80 of the containers.  That means we can call the PHP script with:
http://[HOST_NAME OR HOST_IP]:[PROVIDED PORT]
In my case, I&8217;ve set the IP for my Kubernetes hosts to hostnames to make my life easier, and the PHP file is the default for nginx, so I can simply call:
# curl http://kube-2:30800
Pod soaktest-3869910569-xnfme has finished!
So as you can see, this time the request was served by pod soaktest-3869910569-xnfme.
Recovering from crashes: Creating a fixed number of replicas
Now that we know everything is running, let&8217;s take a look at some replication use cases.

The first thing we think of when it comes to replication is recovering from crashes. If there are 5 (or 50, or 500) copies of an application running, and one or more crashes, it&8217;s not a catastrophe.  Kubernetes improves the situation further by ensuring that if a pod goes down, it&8217;s replaced.

Let&8217;s see this in action.  Start by refreshing our memory about the pods we&8217;ve got running:
# kubectl get pods
NAME                        READY     STATUS    RESTARTS   AGE
soaktest-3869910569-qqwqc   1/1       Running   0          11m
soaktest-3869910569-qu8k7   1/1       Running   0          11m
soaktest-3869910569-uzjxu   1/1       Running   0          11m
soaktest-3869910569-x6vmp   1/1       Running   0          11m
soaktest-3869910569-xnfme   1/1       Running   0          11m
If we repeatedly call the Deployment, we can see that we get different pods on a random basis:
# curl http://kube-2:30800
Pod soaktest-3869910569-xnfme has finished!
# curl http://kube-2:30800
Pod soaktest-3869910569-x6vmp has finished!
# curl http://kube-2:30800
Pod soaktest-3869910569-uzjxu has finished!
# curl http://kube-2:30800
Pod soaktest-3869910569-x6vmp has finished!
# curl http://kube-2:30800
Pod soaktest-3869910569-uzjxu has finished!
# curl http://kube-2:30800
Pod soaktest-3869910569-qu8k7 has finished!
To simulate a pod crashing, let&8217;s go ahead and delete one:
# kubectl delete pod soaktest-3869910569-x6vmp
pod “soaktest-3869910569-x6vmp” deleted

# kubectl get pods
NAME                        READY     STATUS    RESTARTS   AGE
soaktest-3869910569-516kx   1/1       Running   0          18s
soaktest-3869910569-qqwqc   1/1       Running   0          27m
soaktest-3869910569-qu8k7   1/1       Running   0          27m
soaktest-3869910569-uzjxu   1/1       Running   0          27m
soaktest-3869910569-xnfme   1/1       Running   0          27m
As you can see, pod *x6vmp is gone, and it&8217;s been replaced by *516kx.  (You can easily find the new pod by looking at the AGE column.)

If we once again call the Deployment, we can (eventually) see the new pod:
# curl http://kube-2:30800
Pod soaktest-3869910569-516kx has finished!
Now let&8217;s look at changing the number of pods.
Scaling up or down: Manually changing the number of replicas
One common task is to scale up a Deployment in response to additional load. Kubernetes has autoscaling, but we&8217;ll talk about that in another article.  For now, let&8217;s look at how to do this task manually.

The most straightforward way is to simply use the scale command:
# kubectl scale –replicas=7 deployment/soaktest
deployment “soaktest” scaled

# kubectl get pods
NAME                        READY     STATUS    RESTARTS   AGE
soaktest-3869910569-2w8i6   1/1       Running   0          6s
soaktest-3869910569-516kx   1/1       Running   0          11m
soaktest-3869910569-qqwqc   1/1       Running   0          39m
soaktest-3869910569-qu8k7   1/1       Running   0          39m
soaktest-3869910569-uzjxu   1/1       Running   0          39m
soaktest-3869910569-xnfme   1/1       Running   0          39m
soaktest-3869910569-z4rx9   1/1       Running   0          6s
In this case, we specify a new number of replicas, and Kubernetes adds enough to bring it to the desired level, as you can see.

One thing to keep in mind is that Kubernetes isn&8217;t going to scale the Deployment down to be below the level at which you first started it up.  For example, if we try to scale back down to 4&8230;
# kubectl scale –replicas=4 -f deployment.yaml
deployment “soaktest” scaled

# kubectl get pods
NAME                        READY     STATUS    RESTARTS   AGE
soaktest-3869910569-l5wx8   1/1       Running   0          11s
soaktest-3869910569-qqwqc   1/1       Running   0          40m
soaktest-3869910569-qu8k7   1/1       Running   0          40m
soaktest-3869910569-uzjxu   1/1       Running   0          40m
soaktest-3869910569-xnfme   1/1       Running   0          40m
&8230; Kubernetes only brings us back down to 5, because that&8217;s what was specified by the original deployment.
Deploying a new version: Replacing replicas by changing their label
Another way you can use deployments is to make use of the selector.  In other words, if a Deployment controls all the pods with a tier value of dev, changing a pod&8217;s teir label to prod will remove it from the Deployment&8217;s sphere of influence.

This mechanism enables you to selectively replace individual pods. For example, you might move pods from a dev environment to a production environment, or you might do a manual rolling update, updating the image, then removing some fraction of pods from the Deployment; when they&8217;re replaced, it will be with the new image. If you&8217;re happy with the changes, you can then replace the rest of the pods.

Let&8217;s see this in action.  As you recall, this is our Deployment:
# kubectl describe deployment soaktest
Name:                   soaktest
Namespace:              default
CreationTimestamp:      Sun, 05 Mar 2017 19:31:04 +0000
Labels:                 app=soaktest
Selector:               app=soaktest
Replicas:               3 updated | 3 total | 3 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  1 max unavailable, 1 max surge
OldReplicaSets:         <none>
NewReplicaSet:          soaktest-3869910569 (3/3 replicas created)
Events:
 FirstSeen     LastSeen        Count   From                            SubobjectPath   Type              Reason                  Message
 ———     ——–        —–   —-                            ————-   ——–  ——                  ——-
 50s           50s             1       {deployment-controller }                        Normal            ScalingReplicaSet       Scaled up replica set soaktest-3869910569 to 3
And these are our pods:
# kubectl describe replicaset soaktest-3869910569
Name:           soaktest-3869910569
Namespace:      default
Image(s):       nickchase/soaktest
Selector:       app=soaktest,pod-template-hash=3869910569
Labels:         app=soaktest
               pod-template-hash=3869910569
Replicas:       5 current / 5 desired
Pods Status:    5 Running / 0 Waiting / 0 Succeeded / 0 Failed
No volumes.
Events:
 FirstSeen     LastSeen        Count   From                            SubobjectPath   Type              Reason                  Message
 ———     ——–        —–   —-                            ————-   ——–  ——                  ——-
 2m            2m              1       {replicaset-controller }                        Normal            SuccessfulCreate        Created pod: soaktest-3869910569-0577c
 2m            2m              1       {replicaset-controller }                        Normal            SuccessfulCreate        Created pod: soaktest-3869910569-wje85
 2m            2m              1       {replicaset-controller }                        Normal            SuccessfulCreate        Created pod: soaktest-3869910569-xuhwl
 1m            1m              1       {replicaset-controller }                        Normal            SuccessfulCreate        Created pod: soaktest-3869910569-8cbo2
 1m            1m              1       {replicaset-controller }                        Normal            SuccessfulCreate        Created pod: soaktest-3869910569-pwlm4
We can also get a list of pods by label:
# kubectl get pods -l app=soaktest
NAME                        READY     STATUS    RESTARTS   AGE
soaktest-3869910569-0577c   1/1       Running   0          7m
soaktest-3869910569-8cbo2   1/1       Running   0          6m
soaktest-3869910569-pwlm4   1/1       Running   0          6m
soaktest-3869910569-wje85   1/1       Running   0          7m
soaktest-3869910569-xuhwl   1/1       Running   0          7m
So those are our original soaktest pods; what if we wanted to add a new label?  We can do that on the command line:
# kubectl label pods soaktest-3869910569-xuhwl experimental=true
pod “soaktest-3869910569-xuhwl” labeled

# kubectl get pods -l experimental=true
NAME                        READY     STATUS    RESTARTS   AGE
soaktest-3869910569-xuhwl   1/1       Running   0          14m
So now we have one experimental pod.  But since the experimental label has nothing to do with the selector for the Deployment, it doesn&8217;t affect anything.

So what if we change the value of the app label, which the Deployment is looking at?
# kubectl label pods soaktest-3869910569-wje85 app=notsoaktest –overwrite
pod “soaktest-3869910569-wje85″ labeled
In this case, we need to use the overwrite flag because the app label already exists. Now let&8217;s look at the existing pods.
# kubectl get pods
NAME                        READY     STATUS    RESTARTS   AGE
soaktest-3869910569-0577c   1/1       Running   0          17m
soaktest-3869910569-4cedq   1/1       Running   0          4s
soaktest-3869910569-8cbo2   1/1       Running   0          16m
soaktest-3869910569-pwlm4   1/1       Running   0          16m
soaktest-3869910569-wje85   1/1       Running   0          17m
soaktest-3869910569-xuhwl   1/1       Running   0          17m
As you can see, we now have six pods instead of five, with a new pod having been created to replace *wje85, which was removed from the deployment. We can see the changes by requesting pods by label:
# kubectl get pods -l app=soaktest
NAME                        READY     STATUS    RESTARTS   AGE
soaktest-3869910569-0577c   1/1       Running   0          17m
soaktest-3869910569-4cedq   1/1       Running   0          20s
soaktest-3869910569-8cbo2   1/1       Running   0          16m
soaktest-3869910569-pwlm4   1/1       Running   0          16m
soaktest-3869910569-xuhwl   1/1       Running   0          17m
Now, there is one wrinkle that you have to take into account; because we&8217;ve removed this pod from the Deployment, the Deployment no longer manages it.  So if we were to delete the Deployment&8230;
# kubectl delete deployment soaktest
deployment “soaktest” deleted
The pod remains:
# kubectl get pods
NAME                        READY     STATUS    RESTARTS   AGE
soaktest-3869910569-wje85   1/1       Running   0          19m
You can also easily replace all of the pods in a Deployment using the &;all flag, as in:
# kubectl label pods –all app=notsoaktesteither –overwrite
But remember that you&8217;ll have to delete them all manually!
Conclusion
Replication is a large part of Kubernetes&8217; purpose in life, so it&8217;s no surprise that we&8217;ve just scratched the surface of what it can do, and how to use it. It is useful for reliability purposes, for scalability, and even as a basis for your architecture.

What do you anticipate using replication for, and what would you like to know more about? Let us know in the comments!The post Kubernetes Replication Controller, Replica Set and Deployments: Understanding replication options appeared first on Mirantis | Pure Play Open Cloud.
Quelle: Mirantis

Kubernetes Replication Controller, Replica Set and Deployments: Understanding replication options

The post Kubernetes Replication Controller, Replica Set and Deployments: Understanding replication options appeared first on Mirantis | Pure Play Open Cloud.
As a container management tool, Kubernetes was designed to orchestrate multiple containers and replication, and in fact there are currently several ways to do it. In this article, we&;ll look at three options: Replication Controllers, Replica Sets, and Deployments.
What is Kubernetes replication for?
Before we go into how you would do replication, let&8217;s talk about why.  Typically you would want to replicate your containers (and thereby your applications) for several reasons, including:

Reliability: By having multiple versions of an application, you prevent problems if one or more fails.  This is particularly true if the system replaces any containers that fail.
Load balancing: Having multiple versions of a container enables you to easily send traffic to different instances to prevent overloading of a single instance or node. This is something that Kubernetes does out of the box, making it extremely convenient.
Scaling: When load does become too much for the number of existing instances, Kubernetes enables you to easily scale up your application, adding additional instances as needed.

Replication is appropriate for numerous use cases, including:

Microservices-based applications: In these cases, multiple small applications provide very specific functionality.
Cloud native applications: Because cloud-native applications are based on the theory that any component can fail at any time, replication is a perfect environment for implementing them, as multiple instances are baked into the architecture.
Mobile applications: Mobile applications can often be architected so that the mobile client interacts with an isolated version of the server application.

Kubernetes has multiple ways in which you can implement replication.
Types of Kubernetes replication
In this article, we&8217;ll discuss three different forms of replication: the Replication Controller, Replica Sets, and Deployments.
Replication Controller
The Replication Controller is the original form of replication in Kubernetes.  It&8217;s being replaced by Replica Sets, but it&8217;s still in wide use, so it&8217;s worth understanding what it is and how it works.

A Replication Controller is a structure that enables you to easily create multiple pods, then make sure that that number of pods always exists. If a pod does crash, the Replication Controller replaces it.

Replication Controllers also provide other benefits, such as the ability to scale the number of pods, and to update or delete multiple pods with a single command.

You can create a Replication Controller with an imperative command, or declaratively, from a file.  For example, create a new file called rc.yaml and add the following text:
apiVersion: v1
kind: ReplicationController
metadata:
 name: soaktestrc
spec:
 replicas: 3
 selector:
   app: soaktestrc
 template:
   metadata:
     name: soaktestrc
     labels:
       app: soaktestrc
   spec:
     containers:
     – name: soaktestrc
       image: nickchase/soaktest
       ports:
       – containerPort: 80
Most of this structure should look familiar from our discussion of Deployments; we&8217;ve got the name of the actual Replication Controller (soaktestrc) and we&8217;re designating that we should have 3 replicas, each of which are defined by the template.  The selector defines how we know which pods belong to this Replication Controller.

Now tell Kubernetes to create the Replication Controller based on that file:
# kubectl create -f rc.yaml
replicationcontroller “soaktestrc” created
Let&8217;s take a look at what we have using the describe command:
# kubectl describe rc soaktestrc
Name:           soaktestrc
Namespace:      default
Image(s):       nickchase/soaktest
Selector:       app=soaktestrc
Labels:         app=soaktestrc
Replicas:       3 current / 3 desired
Pods Status:    3 Running / 0 Waiting / 0 Succeeded / 0 Failed
No volumes.
Events:
 FirstSeen     LastSeen        Count   From                            SubobjectPath   Type   Reason                   Message
 ———     ——–        —–   —-                            ————-   ————–                  ——-
 1m            1m              1       {replication-controller }                       Normal SuccessfulCreate Created pod: soaktestrc-g5snq
 1m            1m              1       {replication-controller }                       Normal SuccessfulCreate Created pod: soaktestrc-cws05
 1m            1m              1       {replication-controller }                       Normal SuccessfulCreate Created pod: soaktestrc-ro2bl
As you can see, we&8217;ve got the Replication Controller, and there are 3 replicas, of the 3 that we wanted.  All 3 of them are currently running.  You can also see the individual pods listed underneath, along with their names.  If you ask Kubernetes to show you the pods, you can see those same names show up:
# kubectl get pods
NAME               READY     STATUS    RESTARTS   AGE
soaktestrc-cws05   1/1       Running   0          3m
soaktestrc-g5snq   1/1       Running   0          3m
soaktestrc-ro2bl   1/1       Running   0          3m
Next we&8217;ll look at Replica Sets, but first let&8217;s clean up:
# kubectl delete rc soaktestrc
replicationcontroller “soaktestrc” deleted

# kubectl get pods
As you can see, when you delete the Replication Controller, you also delete all of the pods that it created.
Replica Sets
Replica Sets are a sort of hybrid, in that they are in some ways more powerful than Replication Controllers, and in others they are less powerful.

Replica Sets are declared in essentially the same way as Replication Controllers, except that they have more options for the selector.  For example, we could create a Replica Set like this:
apiVersion: extensions/v1beta1
kind: ReplicaSet
metadata:
 name: soaktestrs
spec:
 replicas: 3
 selector:
   matchLabels:
     app: soaktestrs
 template:
   metadata:
     labels:
       app: soaktestrs
  environment: dev
   spec:
     containers:
     – name: soaktestrs
       image: nickchase/soaktest
       ports:
       – containerPort: 80
In this case, it&8217;s more or less the same as when we were creating the Replication Controller, except we&8217;re using matchLabels instead of label.  But we could just as easily have said:

spec:
 replicas: 3
 selector:
    matchExpressions:
     – {key: app, operator: In, values: [soaktestrs, soaktestrs, soaktest]}
     – {key: teir, operator: NotIn, values: [production]}
 template:
   metadata:

In this case, we&8217;re looking at two different conditions:

The app label must be soaktestrc, soaktestrs, or soaktest
The tier label (if it exists) must not be production

Let&8217;s go ahead and create the Replica Set and get a look at it:
# kubectl create -f replicaset.yaml
replicaset “soaktestrs” created

# kubectl describe rs soaktestrs
Name:           soaktestrs
Namespace:      default
Image(s):       nickchase/soaktest
Selector:       app in (soaktest,soaktestrs),teir notin (production)
Labels:         app=soaktestrs
Replicas:       3 current / 3 desired
Pods Status:    3 Running / 0 Waiting / 0 Succeeded / 0 Failed
No volumes.
Events:
 FirstSeen     LastSeen        Count   From                            SubobjectPath   Type    Reason                   Message
 ———     ——–        —–   —-                            ————-   ————–                   ——-
 1m            1m              1       {replicaset-controller }                        Normal  SuccessfulCreate Created pod: soaktestrs-it2hf
 1m            1m              1       {replicaset-controller }                       Normal  SuccessfulCreate Created pod: soaktestrs-kimmm
 1m            1m              1       {replicaset-controller }                        Normal  SuccessfulCreate Created pod: soaktestrs-8i4ra

# kubectl get pods
NAME               READY     STATUS    RESTARTS   AGE
soaktestrs-8i4ra   1/1       Running   0          1m
soaktestrs-it2hf   1/1       Running   0          1m
soaktestrs-kimmm   1/1       Running   0          1m
As you can see, the output is pretty much the same as for a Replication Controller (except for the selector), and for most intents and purposes, they are similar.  The major difference is that the rolling-update command works with Replication Controllers, but won&8217;t work with a Replica Set.  This is because Replica Sets are meant to be used as the backend for Deployments.

Let&8217;s clean up before we move on.
# kubectl delete rs soaktestrs
replicaset “soaktestrs” deleted

# kubectl get pods
Again, the pods that were created are deleted when we delete the Replica Set.
Deployments
Deployments are intended to replace Replication Controllers.  They provide the same replication functions (through Replica Sets) and also the ability to rollout changes and roll them back if necessary.

Let&8217;s create a simple Deployment using the same image we&8217;ve been using.  First create a new file, deployment.yaml, and add the following:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
 name: soaktest
spec:
 replicas: 5
 template:
   metadata:
     labels:
       app: soaktest
   spec:
     containers:
     – name: soaktest
       image: nickchase/soaktest
       ports:
       – containerPort: 80
Now go ahead and create the Deployment:
# kubectl create -f deployment.yaml
deployment “soaktest” created
Now let&8217;s go ahead and describe the Deployment:
# kubectl describe deployment soaktest
Name:                   soaktest
Namespace:              default
CreationTimestamp:      Sun, 05 Mar 2017 16:21:19 +0000
Labels:                 app=soaktest
Selector:               app=soaktest
Replicas:               5 updated | 5 total | 5 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  1 max unavailable, 1 max surge
OldReplicaSets:         <none>
NewReplicaSet:          soaktest-3914185155 (5/5 replicas created)
Events:
 FirstSeen     LastSeen        Count   From                            SubobjectPath   Type    Reason                   Message
 ———     ——–        —–   —-                            ————-   ————–                   ——-
 38s           38s             1       {deployment-controller }                        Normal  ScalingReplicaSet        Scaled up replica set soaktest-3914185155 to 3
 36s           36s             1       {deployment-controller }                        Normal  ScalingReplicaSet        Scaled up replica set soaktest-3914185155 to 5
As you can see, rather than listing the individual pods, Kubernetes shows us the Replica Set.  Notice that the name of the Replica Set is the Deployment name and a hash value.

A complete discussion of updates is out of scope for this article &; we&8217;ll cover it in the future &8212; but couple of interesting things here:

The StrategyType is RollingUpdate. This value can also be set to Recreate.
By default we have a minReadySeconds value of 0; we can change that value if we want pods to be up and running for a certain amount of time &8212; say, to load resources &8212; before they&8217;re truly considered &;ready&;.
The RollingUpdateStrategy shows that we have a limit of 1 maxUnavailable &8212; meaning that when we&8217;re updating the Deployment, we can have up to 1 missing pod before it&8217;s replaced, and 1 maxSurge, meaning we can have one extra pod as we scale the new pods back up.

As you can see, the Deployment is backed, in this case, by Replica Set soaktest-3914185155. If we go ahead and look at the list of actual pods&;
# kubectl get pods
NAME                        READY     STATUS    RESTARTS   AGE
soaktest-3914185155-7gyja   1/1       Running   0          2m
soaktest-3914185155-lrm20   1/1       Running   0          2m
soaktest-3914185155-o28px   1/1       Running   0          2m
soaktest-3914185155-ojzn8   1/1       Running   0          2m
soaktest-3914185155-r2pt7   1/1       Running   0          2m
&8230; you can see that their names consist of the Replica Set name and an additional identifier.
Passing environment information: identifying a specific pod
Before we look at the different ways that we can affect replicas, let&8217;s set up our deployment so that we can see what pod we&8217;re actually hitting with a particular request.  To do that, the image we&8217;ve been using displays the pod name when it outputs:
<?php
$limit = $_GET[‘limit’];
if (!isset($limit)) $limit = 250;
for ($i; $i < $limit; $i++){
    $d = tan(atan(tan(atan(tan(atan(tan(atan(tan(atan(123456789.123456789))))))))));
}
echo “Pod “.$_SERVER[‘POD_NAME’].” has finished!n”;
?>
As you can see, we&8217;re displaying an environment variable, POD_NAME.  Since each container is essentially it&8217;s own server, this will display the name of the pod when we execute the PHP.

Now we just have to pass that information to the pod.

We do that through the use of the Kubernetes Downward API, which lets us pass environment variables into the containers:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
 name: soaktest
spec:
 replicas: 3
 template:
   metadata:
     labels:
       app: soaktest
   spec:
     containers:
     – name: soaktest
       image: nickchase/soaktest
       ports:
       – containerPort: 80
       env:
       – name: POD_NAME
         valueFrom:
           fieldRef:
             fieldPath: metadata.name
As you can see, we&8217;re passing an environment variable and assigning it a value from the Deployment&8217;s metadata.  (You can find more information on metadata here.)

So let&8217;s go ahead and clean up the Deployment we created earlier&8230;
# kubectl delete deployment soaktest
deployment “soaktest” deleted

# kubectl get pods
&8230; and recreate it with the new definition:
# kubectl create -f deployment.yaml
deployment “soaktest” created
Next let&8217;s go ahead and expose the pods to outside network requests so we can call the nginx server that is inside the containers:
# kubectl expose deployment soaktest –port=80 –target-port=80 –type=NodePort
service “soaktest” exposed
Now let&8217;s describe the services we just created so we can find out what port the Deployment is listening on:
# kubectl describe services soaktest
Name:                   soaktest
Namespace:              default
Labels:                 app=soaktest
Selector:               app=soaktest
Type:                   NodePort
IP:                     11.1.32.105
Port:                   <unset> 80/TCP
NodePort:               <unset> 30800/TCP
Endpoints:              10.200.18.2:80,10.200.18.3:80,10.200.18.4:80 + 2 more…
Session Affinity:       None
No events.
As you can see, the NodePort is 30800 in this case; in your case it will be different, so make sure to check.  That means that each of the servers involved is listening on port 30800, and requests are being forwarded to port 80 of the containers.  That means we can call the PHP script with:
http://[HOST_NAME OR HOST_IP]:[PROVIDED PORT]
In my case, I&8217;ve set the IP for my Kubernetes hosts to hostnames to make my life easier, and the PHP file is the default for nginx, so I can simply call:
# curl http://kube-2:30800
Pod soaktest-3869910569-xnfme has finished!
So as you can see, this time the request was served by pod soaktest-3869910569-xnfme.
Recovering from crashes: Creating a fixed number of replicas
Now that we know everything is running, let&8217;s take a look at some replication use cases.

The first thing we think of when it comes to replication is recovering from crashes. If there are 5 (or 50, or 500) copies of an application running, and one or more crashes, it&8217;s not a catastrophe.  Kubernetes improves the situation further by ensuring that if a pod goes down, it&8217;s replaced.

Let&8217;s see this in action.  Start by refreshing our memory about the pods we&8217;ve got running:
# kubectl get pods
NAME                        READY     STATUS    RESTARTS   AGE
soaktest-3869910569-qqwqc   1/1       Running   0          11m
soaktest-3869910569-qu8k7   1/1       Running   0          11m
soaktest-3869910569-uzjxu   1/1       Running   0          11m
soaktest-3869910569-x6vmp   1/1       Running   0          11m
soaktest-3869910569-xnfme   1/1       Running   0          11m
If we repeatedly call the Deployment, we can see that we get different pods on a random basis:
# curl http://kube-2:30800
Pod soaktest-3869910569-xnfme has finished!
# curl http://kube-2:30800
Pod soaktest-3869910569-x6vmp has finished!
# curl http://kube-2:30800
Pod soaktest-3869910569-uzjxu has finished!
# curl http://kube-2:30800
Pod soaktest-3869910569-x6vmp has finished!
# curl http://kube-2:30800
Pod soaktest-3869910569-uzjxu has finished!
# curl http://kube-2:30800
Pod soaktest-3869910569-qu8k7 has finished!
To simulate a pod crashing, let&8217;s go ahead and delete one:
# kubectl delete pod soaktest-3869910569-x6vmp
pod “soaktest-3869910569-x6vmp” deleted

# kubectl get pods
NAME                        READY     STATUS    RESTARTS   AGE
soaktest-3869910569-516kx   1/1       Running   0          18s
soaktest-3869910569-qqwqc   1/1       Running   0          27m
soaktest-3869910569-qu8k7   1/1       Running   0          27m
soaktest-3869910569-uzjxu   1/1       Running   0          27m
soaktest-3869910569-xnfme   1/1       Running   0          27m
As you can see, pod *x6vmp is gone, and it&8217;s been replaced by *516kx.  (You can easily find the new pod by looking at the AGE column.)

If we once again call the Deployment, we can (eventually) see the new pod:
# curl http://kube-2:30800
Pod soaktest-3869910569-516kx has finished!
Now let&8217;s look at changing the number of pods.
Scaling up or down: Manually changing the number of replicas
One common task is to scale up a Deployment in response to additional load. Kubernetes has autoscaling, but we&8217;ll talk about that in another article.  For now, let&8217;s look at how to do this task manually.

The most straightforward way is to simply use the scale command:
# kubectl scale –replicas=7 deployment/soaktest
deployment “soaktest” scaled

# kubectl get pods
NAME                        READY     STATUS    RESTARTS   AGE
soaktest-3869910569-2w8i6   1/1       Running   0          6s
soaktest-3869910569-516kx   1/1       Running   0          11m
soaktest-3869910569-qqwqc   1/1       Running   0          39m
soaktest-3869910569-qu8k7   1/1       Running   0          39m
soaktest-3869910569-uzjxu   1/1       Running   0          39m
soaktest-3869910569-xnfme   1/1       Running   0          39m
soaktest-3869910569-z4rx9   1/1       Running   0          6s
In this case, we specify a new number of replicas, and Kubernetes adds enough to bring it to the desired level, as you can see.

One thing to keep in mind is that Kubernetes isn&8217;t going to scale the Deployment down to be below the level at which you first started it up.  For example, if we try to scale back down to 4&8230;
# kubectl scale –replicas=4 -f deployment.yaml
deployment “soaktest” scaled

# kubectl get pods
NAME                        READY     STATUS    RESTARTS   AGE
soaktest-3869910569-l5wx8   1/1       Running   0          11s
soaktest-3869910569-qqwqc   1/1       Running   0          40m
soaktest-3869910569-qu8k7   1/1       Running   0          40m
soaktest-3869910569-uzjxu   1/1       Running   0          40m
soaktest-3869910569-xnfme   1/1       Running   0          40m
&8230; Kubernetes only brings us back down to 5, because that&8217;s what was specified by the original deployment.
Deploying a new version: Replacing replicas by changing their label
Another way you can use deployments is to make use of the selector.  In other words, if a Deployment controls all the pods with a tier value of dev, changing a pod&8217;s teir label to prod will remove it from the Deployment&8217;s sphere of influence.

This mechanism enables you to selectively replace individual pods. For example, you might move pods from a dev environment to a production environment, or you might do a manual rolling update, updating the image, then removing some fraction of pods from the Deployment; when they&8217;re replaced, it will be with the new image. If you&8217;re happy with the changes, you can then replace the rest of the pods.

Let&8217;s see this in action.  As you recall, this is our Deployment:
# kubectl describe deployment soaktest
Name:                   soaktest
Namespace:              default
CreationTimestamp:      Sun, 05 Mar 2017 19:31:04 +0000
Labels:                 app=soaktest
Selector:               app=soaktest
Replicas:               3 updated | 3 total | 3 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  1 max unavailable, 1 max surge
OldReplicaSets:         <none>
NewReplicaSet:          soaktest-3869910569 (3/3 replicas created)
Events:
 FirstSeen     LastSeen        Count   From                            SubobjectPath   Type              Reason                  Message
 ———     ——–        —–   —-                            ————-   ——–  ——                  ——-
 50s           50s             1       {deployment-controller }                        Normal            ScalingReplicaSet       Scaled up replica set soaktest-3869910569 to 3
And these are our pods:
# kubectl describe replicaset soaktest-3869910569
Name:           soaktest-3869910569
Namespace:      default
Image(s):       nickchase/soaktest
Selector:       app=soaktest,pod-template-hash=3869910569
Labels:         app=soaktest
               pod-template-hash=3869910569
Replicas:       5 current / 5 desired
Pods Status:    5 Running / 0 Waiting / 0 Succeeded / 0 Failed
No volumes.
Events:
 FirstSeen     LastSeen        Count   From                            SubobjectPath   Type              Reason                  Message
 ———     ——–        —–   —-                            ————-   ——–  ——                  ——-
 2m            2m              1       {replicaset-controller }                        Normal            SuccessfulCreate        Created pod: soaktest-3869910569-0577c
 2m            2m              1       {replicaset-controller }                        Normal            SuccessfulCreate        Created pod: soaktest-3869910569-wje85
 2m            2m              1       {replicaset-controller }                        Normal            SuccessfulCreate        Created pod: soaktest-3869910569-xuhwl
 1m            1m              1       {replicaset-controller }                        Normal            SuccessfulCreate        Created pod: soaktest-3869910569-8cbo2
 1m            1m              1       {replicaset-controller }                        Normal            SuccessfulCreate        Created pod: soaktest-3869910569-pwlm4
We can also get a list of pods by label:
# kubectl get pods -l app=soaktest
NAME                        READY     STATUS    RESTARTS   AGE
soaktest-3869910569-0577c   1/1       Running   0          7m
soaktest-3869910569-8cbo2   1/1       Running   0          6m
soaktest-3869910569-pwlm4   1/1       Running   0          6m
soaktest-3869910569-wje85   1/1       Running   0          7m
soaktest-3869910569-xuhwl   1/1       Running   0          7m
So those are our original soaktest pods; what if we wanted to add a new label?  We can do that on the command line:
# kubectl label pods soaktest-3869910569-xuhwl experimental=true
pod “soaktest-3869910569-xuhwl” labeled

# kubectl get pods -l experimental=true
NAME                        READY     STATUS    RESTARTS   AGE
soaktest-3869910569-xuhwl   1/1       Running   0          14m
So now we have one experimental pod.  But since the experimental label has nothing to do with the selector for the Deployment, it doesn&8217;t affect anything.

So what if we change the value of the app label, which the Deployment is looking at?
# kubectl label pods soaktest-3869910569-wje85 app=notsoaktest –overwrite
pod “soaktest-3869910569-wje85″ labeled
In this case, we need to use the overwrite flag because the app label already exists. Now let&8217;s look at the existing pods.
# kubectl get pods
NAME                        READY     STATUS    RESTARTS   AGE
soaktest-3869910569-0577c   1/1       Running   0          17m
soaktest-3869910569-4cedq   1/1       Running   0          4s
soaktest-3869910569-8cbo2   1/1       Running   0          16m
soaktest-3869910569-pwlm4   1/1       Running   0          16m
soaktest-3869910569-wje85   1/1       Running   0          17m
soaktest-3869910569-xuhwl   1/1       Running   0          17m
As you can see, we now have six pods instead of five, with a new pod having been created to replace *wje85, which was removed from the deployment. We can see the changes by requesting pods by label:
# kubectl get pods -l app=soaktest
NAME                        READY     STATUS    RESTARTS   AGE
soaktest-3869910569-0577c   1/1       Running   0          17m
soaktest-3869910569-4cedq   1/1       Running   0          20s
soaktest-3869910569-8cbo2   1/1       Running   0          16m
soaktest-3869910569-pwlm4   1/1       Running   0          16m
soaktest-3869910569-xuhwl   1/1       Running   0          17m
Now, there is one wrinkle that you have to take into account; because we&8217;ve removed this pod from the Deployment, the Deployment no longer manages it.  So if we were to delete the Deployment&8230;
# kubectl delete deployment soaktest
deployment “soaktest” deleted
The pod remains:
# kubectl get pods
NAME                        READY     STATUS    RESTARTS   AGE
soaktest-3869910569-wje85   1/1       Running   0          19m
You can also easily replace all of the pods in a Deployment using the &;all flag, as in:
# kubectl label pods –all app=notsoaktesteither –overwrite
But remember that you&8217;ll have to delete them all manually!
Conclusion
Replication is a large part of Kubernetes&8217; purpose in life, so it&8217;s no surprise that we&8217;ve just scratched the surface of what it can do, and how to use it. It is useful for reliability purposes, for scalability, and even as a basis for your architecture.

What do you anticipate using replication for, and what would you like to know more about? Let us know in the comments!The post Kubernetes Replication Controller, Replica Set and Deployments: Understanding replication options appeared first on Mirantis | Pure Play Open Cloud.
Quelle: Mirantis

InfraKit and Docker Swarm Mode: A Fault-Tolerant and Self-Healing Cluster

Back in October 2016, released , an open source toolkit for creating and managing declarative, self-healing infrastructure. This is the second in a two part series that dives more deeply into the internals of InfraKit.
Introduction
In the first installment of this two part series about the internals of InfraKit, we presented InfraKit’s design, architecture, and approach to high availability.  We also discussed how it can be combined with other systems to give distributed computing clusters self-healing and self-managing properties. In this installment, we present an example of leveraging Docker Engine in Swarm Mode to achieve high availability for InfraKit, which in turn enhances the Docker Swarm cluster by making it self-healing.  
Docker Swarm Mode and InfraKit
One of the key architectural features of Docker in Swarm Mode is the manager quorum powered by SwarmKit.  The manager quorum stores information about the cluster, and the consistency of information is achieved through consensus via the Raft consensus algorithm, which is also at the heart of other systems like Etcd. This guide gives an overview of the architecture of Docker Swarm Mode and how the manager quorum maintains the state of the cluster.
One aspect of the cluster state maintained by the quorum is node membership &; what nodes are in the cluster, who are the managers and workers, and their statuses. The Raft consensus algorithm gives us guarantees about our cluster’s behavior in the face of failure, and fault tolerance of the cluster is related to the number of manager nodes in the quorum. For example, a Docker Swarm with three managers can tolerate one node outage, planned or unplanned, while a quorum of five managers can tolerate outages of up to two members, possibly one planned and one unplanned.
The Raft quorum makes the Docker Swarm cluster fault tolerant; however, it cannot fix itself.  When the quorum experiences outage of manager nodes, manual steps are needed to troubleshoot and restore the cluster.  These procedures require the operator to update or restore the quorum’s topology by demoting and removing old nodes from the quorum and joining new manager nodes when replacements are brought online.  
While these administration tasks are easy via the Docker command line interface, InfraKit can automate this and make the cluster self-healing.  As described in our last post, InfraKit can be deployed in a highly available manner, with multiple replicas running and only one active master.  In this configuration, the InfraKit replicas can accept external input to determine which replica is the active master.  This makes it easy to integrate InfraKit with Docker in Swarm Mode: by running InfraKit on each manager node of the Swarm and by detecting the leadership changes in the Raft quorum via standard Docker API, InfraKit achieves the same fault-tolerance as the Swarm cluster. In turn, InfraKit’s monitoring and infrastructure orchestration capabilities, when there’s an outage, can automatically restore the quorum, making the cluster self-healing.
Example: A Docker Swarm with InfraKit on AWS
To illustrate this idea, we created a Cloudformation template that will bootstrap and create a cluster of Docker in Swarm Mode managed by InfraKit on AWS.  There are couple of ways to run this: you can clone the InfraKit examples repo and upload the template, or you can use this URL to launch the stack in the Cloudformation console.
Please note that this Cloudformation script is for demonstrations only and may not represent best practices.  However, technical users should experiment and customize it to suit their purposes.  A few things about this Cloudformation template:

As a demo, only a few regions are supported: us-west-1 (Northern California), us-west-2 (Oregon), us-east-1 (Northern Virginia), and eu-central-1 (Frankfurt).
It takes the cluster size (number of nodes), SSH key, and instance sizes as the primary user input when launching the stack.
There are options for installing the latest Docker Engine on a base Ubuntu 16.04 AMI or using images which we have pre-installed Docker and published for this demonstration.
It bootstraps the networking environment by creating a VPC, a gateway and routes, a subnet, and a security group.
It creates an IAM role for InfraKit’s AWS instance plugin to describe and create EC2 instances.
It creates a single bootstrap EC2 instance and three EBS volumes (more on this later).  The bootstrap instance is attached to one of the volumes and will be the first leader of the Swarm.  The entire Swarm cluster will grow from this seed, as driven by InfraKit.

With the elements above, this Cloudformation script has everything needed to boot up an Infrakit-managed Docker in Swarm Mode cluster of N nodes (with 3 managers and N-3 workers).  
About EBS Volumes and Auto-Scaling Groups
The use of EBS volumes in our example demonstrates an alternative approach to managing Docker Swarm Mode managers.  Instead of relying on manually updating the quorum topology by removing and then adding new manager nodes to replace crashed instances, we use EBS volumes attached to the manager instances and mounted at /var/lib/docker for durable state that survive past the life of an instance.  As soon as the volume of a terminated manager node is attached to a new replacement EC2 instance, we can carry the cluster state forward quickly because there’s much less state changes to catch up to.  This approach is attractive for large clusters running many nodes and services, where the entirety of cluster state may take a long time to be replicated to a brand new manager that just joined the Swarm.  
The use of persistent volumes in this example highlights InfraKit’s philosophy of running stateful services on immutable infrastructure:

Use compute instances for just the processing cores;  they can come and go.
Keep state on persistent volumes that can survive when compute instances don’t.
The orchestrator has the responsibility to maintain members in a group identified by fixed logical ID’s.  In this case these are the private IP addresses for the Swarm managers.
The pairing of logical ID (IP address) and state (on volume) need to be maintained.

This brings up a related implementation detail &8212; why not use the Auto-Scaling Groups implementations that are already there?  First, auto-scaling group implementations vary from one cloud provider to the next, if even available.  Second, most auto-scalers are designed to manage cattle, where individual instances in a group are identical to one another.  This is clearly not the case for the Swarm managers:

The managers have some kind of identity as resources (via IP addresses)
As infrastructure resources, members of a group know about each other via membership in this stable set of IDs.
The managers identified by these IP addresses have state that need to be detached and reattached across instance lifetimes.  The pairing must be maintained.

Current auto-scaling group implementations focus on managing identical instances in a group.  New instances are launched with assigned IP addresses that don’t match the expectations of the group, and volumes from failed instances in an auto-scaling group don’t carry over to the new instance.  It is possible to work around these limitations with sweat and conviction; InfraKit, through support of allocation, logical IDs and attachments, support this use case natively.
Bootstrapping InfraKit and the Swarm
So far, the Cloudformation template implements what we called ‘bootstrapping’, or the process of creating the minimal set of resources to jumpstart an InfraKit managed cluster.  With the creation of the networking environment and the first “seed” EC2 instance, InfraKit has the requisite resources to take over and complete provisioning of the cluster to match the user’s specification of N nodes (with 3 managers and N-3 workers).   Here is an outline of the process:
When the single “seed” EC2 instance boots up, a single line of code is executed in the UserData (aka cloudinit), in Cloudformation JSON:
“docker run –rm “,{“Ref”:”InfrakitCore”},” infrakit template –url “,
{“Ref”:”InfrakitConfigRoot”}, “/boot.sh”,
” –global /cluster/name=”, {“Ref”:”AWS::StackName”},
” –global /cluster/swarm/size=”, {“Ref”:”ClusterSize”},
” –global /provider/image/hasDocker=yes”,
” –global /infrakit/config/root=”, {“Ref”:”InfrakitConfigRoot”},
” –global /infrakit/docker/image=”, {“Ref”:”InfrakitCore”},
” –global /infrakit/instance/docker/image=”, {“Ref”:”InfrakitInstancePlugin”},
” –global /infrakit/metadata/docker/image=”, {“Ref”:”InfrakitMetadataPlugin”},
” –global /infrakit/metadata/configURL=”, {“Ref”:”MetadataExportTemplate”},
” | tee /var/lib/infrakit.boot | sh n”
Here, we are running InfraKit packaged in a Docker image, and most of this Cloudformation statement references the Parameters (e.g. “InfrakitCore” and “ClusterSize”) defined at the beginning of the template.  Using parameters values in the stack template, this translates to a single statement like this that will execute during bootup of the instance:
docker run –rm infrakit/devbundle:0.4.1 infrakit template
–url https://infrakit.github.io/examples/swarm/boot.sh
–global /cluster/name=mystack
–global /cluster/swarm/size=4 # many more …
| tee /var/lib/infrakit.boot | sh # tee just makes a copy on disk

This single statement marks the hand-off from Cloudformation to InfraKit.  When the seed instance starts up (and installs Docker, if not already part of the AMI), the InfraKit container is run to execute the InfraKit template command.  The template command takes a URL as the source of the template (e.g. https://infrakit.github.io/examples/swarm/boot.sh, or a local file with a URL like file://) and a set of pre-conditions (as the &;global variables) and renders.  Through the &8211;global flags, we are able to pass a set of parameters entered by the user when launching the Cloudformation stack. This allows InfraKit to use Cloudformation as authentication and user interface for configuring the cluster.
InfraKit uses templates to simplify complex scripting and configuration tasks.  The templates can be any text that uses { { } } tags, aka “handle bar” syntax.  Here InfraKit is given a set of input parameters from the Cloudformation template and a URL referencing the boot script.  It then fetches the template and renders a script that is executed to perform the following during boot-up of the instance:
 

Formatting the EBS if it’s not already formatted
Stopping Docker if currently running and mount the volume at /var/lib/docker
Configure the Docker engine with proper labels, restarting it.
Starts up an InfraKit metadata plugin that can introspect its environment.  The AWS instance plugin, in v0.4.1, can introspect an environment formed by Cloudformation, as well as, using the instance metadata service available on AWS.   InfraKit metadata plugins can export important parameters in a read-only namespace that can be referenced in templates as file-system paths.  
Start the InfraKit containers such as the manager, group, instance, and Swarm flavor plugins.
Initializes the Swarm via docker swarm init.
Generates a config JSON for InfraKit itself.  This JSON is also rendered by a template (https://github.com/infrakit/examples/blob/v0.4.1/swarm/groups.json) that references environmental parameters like region, availability zone, subnet id’s and security group id’s that are exported by the metadata plugins.
Performs a infrakit manager commit to tell InfraKit to begin managing the cluster.

See https://github.com/infrakit/examples/blob/v0.4.1/swarm/boot.sh for details.
When the InfraKit replica begins running, it notices that the current infrastructure state (of only one node) does not match the user’s specification of 3 managers and N-3 worker nodes.  InfraKit will then drive the infrastructure state toward user’s specification by creating the rest of the managers and workers to complete the Swarm.
The topic of metadata and templating in InfraKit will be the subjects of future blog posts.  In a nutshell, metadata is information exposed by compatible plugins organized and accessible in a cluster-wide namespace.  Metadata can be accessed in the InfraKit CLI or in templates with file-like path names.  You can think of this as a cluster-wide read-only sysfs.  InfraKit template engine, on the other hand, can make use of this data to render complex configuration script files or JSON documents. The template engine supports fetching a collection of templates from local directory or from a remote site, like the example Github repo that has been configured to serve up the templates like a static website or S3 bucket.
 
Running the Example
You can either fork the examples repo or use this URL to launch the stack on AWS console.   Here we first bootstrap the Swarm with the Cloudformation template, then InfraKit takes over and provisions the rest of the cluster.  Then, we will demonstrate fault tolerance and self-healing by terminating the leader manager node in the Swarm to induce fault and force failover and recovery.
When you launch the stack, you have to answer a few questions:

The size of the cluster.  This script always starts a Swarm with 3 managers, so use a value greater than 3.

The SSH key.

There’s an option to install Docker or use an AMI with Docker pre-installed.  An AMI with Docker pre-installed gives shorter startup time when InfraKit needs to spin up a replacement instance.

Once you agree and launches the stack, it takes a few minutes for the cluster to be up.  In this case, we start a 4 node cluster.  In the AWS console we can verify that the cluster is fully provisioned by InfraKit:

Note the private IP addresses 172.31.16.101, 172.31.16.102, and 172.31.16.103 are assigned to the Swarm managers, and they are the values in our configuration. In this example the public IP addresses are dynamically assigned: 35.156.207.156 is bound to the manager instance at 172.31.16.101.  
Also, we see that InfraKit has attached the 3 EBS volumes to the manager nodes:

Because InfraKit is configured with the Swarm Flavor plugin, it also made sure that the manager and worker instances successfully joined the Swarm.  To illustrate this, we can log into the manager instances and run docker node ls. As a means to visualize the Swarm membership in real-time, we log into all three manager instances and run
watch -d docker node ls  
The watch command will by default refresh docker node ls every 2 seconds.  This allows us to not only watch the Swarm membership changes in real-time but also check the availability of the Swarm as a whole.

Note that at this time, the leader of the Swarm is just as we expected, the bootstrap instance, 172.31.16.101.  
Let’s make a note of this instance’s public IP address (35.156.207.156), private IP address (172.31.16.101), and its Swarm Node cryptographic identity (qpglaj6egxvl20vuisdbq8klr).  Now, to test fault tolerance and self-healing, let’s terminate this very leader instance.  As soon as this instance is terminated, we would expect the quorum leadership to go to a new node, and consequently, the InfraKit replica running on that node will become the new master.

Immediately the screen shows there is an outage:  In the top terminal, the connection to the remote host (172.31.16.101) is lost.  In the second and third terminals below, the Swarm node lists are being updated in real time:

When the 172.31.16.101 instance is terminated, the leadership of the quorum is transferred to another node at IP address 172.31.16.102 Docker Swarm Mode is able to tolerate this failure and continue to function (as seen by the continuously functioning of docker node ls by the remaining managers).  However, the Swarm has noticed that the 172.31.16.101 instance is now Down and Unreachable.

As configured, a quorum of 3 managers can tolerate one instance outage.   At this point, the cluster continues operation without interruption.  All your apps running on the Swarm continue to work and you can deploy services as usual.  However, without any automation, the operator needs to intervene at some point and perform some tasks to restore the cluster before another outage to the remaining nodes occur.  
Because this cluster is managed by InfraKit, the replica running on 172.31.16.102 now becomes the master when the same instance assumes leadership of the quorum.  Because InfraKit is tasked to maintain the specification of 3 manager instances with IP addresses 172.31.16.101, 172.31.16.102, and 172.31.16.103, it will take action when it notices 172.31.16.101 is missing.  In order to correct the situation, it will

Create a new instance with the private IP address 172.31.16.101
Attach the EBS volume that was previously associated with the downed instance
Restore the volume, so that Docker Engine and InfraKit starts running on that new instance.
Join the new instance to the Swarm.

As seen above, the new instance at private IP 172.31.16.101 now has an ephemeral public IP address 35.157.163.34, when it was previously 35.156.207.156.  We also see that the EBS volume has been re-attached:

Because of re-attaching the EBS volume as /var/lib/docker for the new instance and using the same IP address, the new instance will appear exactly as though the downed instance was resurrected and rejoins the cluster.  So as far as the Swarm is concerned, 172.31.16.101 may as well have been subjected to a temporary network partition and has since recovered and rejoined the cluster:

At this point, the cluster has recovered without any manual intervention.  The managers are now showing as healthy, and the quorum lives on!
Conclusion
While this example is only a proof-of-concept, we hope it demonstrates the potential of InfraKit as an active infrastructure orchestrator which can make a distributed computing cluster both fault-tolerant and self-healing.  As these features and capabilities mature and harden, we will incorporate them into Docker products such as Docker Editions for AWS and Azure.
InfraKit is a young project and rapidly evolving, and we are actively testing and building ways to safeguard and automate the operations of large distributed computing clusters.   While this project is being developed in the open, your ideas and feedback can help guide us down the path toward making distributed computing resilient and easy to operate.
Check out the InfraKit repository README for more info, a quick tutorial and to start experimenting &8212; from plain files to Terraform integration to building a Zookeeper ensemble. Have a look, explore, and join us on Github or online at the Docker Community Slack Channel (infrakit).  Send us a PR, open an issue, or just say hello.  We look forward to hearing from you!
More Resources:

Check out all the Infrastructure Plumbing projects
The InfraKit examples GitHub repo
Sign up for Docker for AWS or Docker for Azure
Try Docker today 

Part 2: InfraKit and Docker Swarm Mode: A Fault-Tolerant and Self-Healing Cluster by @dchungsfClick To Tweet

The post InfraKit and Docker Swarm Mode: A Fault-Tolerant and Self-Healing Cluster appeared first on Docker Blog.
Quelle: https://blog.docker.com/feed/

What’s new in OpenStack Ocata webinar — Q&A

The post What&;s new in OpenStack Ocata webinar &; Q&;A appeared first on Mirantis | Pure Play Open Cloud.
On February 22, my colleagues Rajat Jain, Stacy Verroneau, and Michael Tillman and I held a webinar to discuss the new features in OpenStack&8217;s latest release, Ocata. Unfortunately, we ran out of time for questions and answers, so here they are.
Q: What are the benefits of using the cells capability?
Rajat: The cells concept was introduced in the Juno release, and as some of you may recall, it was to allow a large number of nova/compute instances to share openstack services.

Therefore, Cells functionality enables you to scale an OpenStack Compute cloud in a more distributed fashion without having to use complicated technologies like database and message queue clustering. It supports very large deployments.

When this functionality is enabled, the hosts in an OpenStack Compute cloud are partitioned into groups called cells. Cells are configured as a tree. The top-level cell should have a host that runs a nova-api service, but no nova-compute services. Each child cell should run all of the typical nova-* services in a regular Compute cloud except for nova-api. You can think of cells as a normal Compute deployment in that each cell has its own database server and message queue broker. This was achieved by the nova cells and nova api services to provide the capabilities.
One of the key changes in Ocata is the upgrade to cells v2, which now only relies on the nova api service for all the synchronization across the cells.
Q: What is the placement service and how can I leverage it?
Rajat: The placement service, which was introduced in the Newton release, is now a key part of OpenStack and also mandatory in determining the optimum placement of VMs. Basically, you set up pools of resources, provide an inventory of the compute nodes, and then set up allocations for resource providers. Then you can set up policies and models for optimum placements of VMs.
Q: What is the OS profiler, and why is it useful?
Rajat: OpenStack consists of multiple projects. Each project, in turn, is  composed of multiple services. To process a request &8212; for example, to boot a virtual machine &8212; OpenStack uses multiple services from different projects. If something in this process runs slowly, it&8217;s extremely complicated to understand what exactly goes wrong and to locate the bottleneck.
To resolve this issue,  a tiny but powerful library, osprofiler, was introduced. The osprofiler library will be used by all OpenStack projects and their python clients. It provides functionality to be able to generate 1 trace per request, flowing through all involved services. This trace can then be extracted and used to build a tree of calls which can be quite handy for a variety of reasons (for example, in isolating cross-project performance issues).
Q: If I have keystone connected to a backend active directory, will i benefit from the auto-provisioning of the federated identity?
Rajat: Yes. The federated identity mapping engine now supports the ability to automatically provision projects for federated users. A role assignment will automatically be created for the user on the specified project. Prior to this, a federated user had to attempt to authenticate before an administrator could assign roles directly to their shadowed identity, resulting in a strange user experience. This is therefore a big usability enhancement for deployers leveraging the federated identity plugins.
Q: Is FWaaS really used out there?
Stacy: Yes it is, but its viability in production is debatable and going with a 3rd party with a Neutron plugin is still, IMHO, the way to go.
Q: When is Octavia GA planned to be released?
Stacy: Octavia is forecast to be GA in the Pike release.
Q: Are DragonFlow and Tricircle ready for Production?
Stacy: Those are young big tent projects but pretty sure we will see a big evolution for Pike.  
Q: What&8217;s the codename for placement service please?
Stacy: It&8217;s just called the Placement API. There&8217;s no fancy name.
Q: Does Ocata continue support for Fernet tokens?
Rajat: Yes.
Q: With federated provider,  can i integrate openstack env with my on-prem AD and allow domain users to use Openstack?
Rajat: This was always supported, and is not new to ocata. More details at https://docs.openstack.org/admin-guide/identity-integrate-with-ldap.html
What&8217;s new in this area is that the federated identity mapping engine now supports the ability to automatically provision projects for federated users. A role assignment will automatically be created for the user on the specified project. Prior to this, a federated user had to attempt to authenticate before an administrator could assign roles directly to their shadowed identity, resulting in a strange user experience.

Q: if i&8217;m using my existing domain users from AD to openstack,  how would i control their rights/role to perform specific tasks in the openstack project?
Rajat: You would first set up authentication via LDAP, then provide connection settings for AD and also set the identity driver to ldap in the keystone.conf. Next you will have to do an assignment of roles and projects to the AD users. Since Mitaka, the only option that you can use is the SQL driver for the assignment in the keystone.conf, but you will have to do the mapping. Most users prefer this approach anyway, as they want to keep the AD as read only from the OpenStack connection. You can find more details on how to configure keystone with LDAP here.
Q: What, if anything, was pushed out of the &;big tent&; and/or did not get robustly worked?
Nick:  You can get a complete view of work done on every project at Stackalytics.
Q: So when is Tricircle being released for use in production?
Stacy: Not soon enough.  Being a new Big Tent project, it needs some time to develop traction.  
Q: Do we support creation of SRIOV ports from horizon during instance creation. If not, are there any plans there?
Nick: According to the Horizon team, you can pre-create the port and assign it to an instance.
Q: Way to go warp speed Michael! Good job Rajat and Stacy. Don&8217;t worry about getting behind, I blame Nick anyway. Then again I always I always blame Nick.
Nick: Thanks Ben, I appreciate you, too.

How to avoid getting clobbered when your cloud host goes down

The post How to avoid getting clobbered when your cloud host goes down appeared first on Mirantis | Pure Play Open Cloud.
Yesterday, while working on an upcoming tutorial, I was suddenly reminded how interconnected the web really is. Everything was humming along nicely, until I tried to push changes to a very large repository. That&;s when everything came to a screeching halt.
&;No problem,&; I thought.  &8220;Everybody has glitches once in a while.&8221;  So I decided I&8217;d work on a different piece of content, and pulled up another browser window for the project management system we use to get the URL. The servers, I was told, were &8220;receiving some TLC.&8221;  
OK, what about that mailing list task I was going to take care of?  Nope, that was down too.
As you probably know by now, all of these problems were due to a failure in one of Amazon Web Services&8217; S3 storage data centers.  According to the BBC, the outage even affected sites as large as Netflix, Spotify, and AirBnB.
Now, you may think I&8217;m writing this to gloat &; after all, here at Mirantis we obviously talk a lot about OpenStack, and one of the things we often hear is &8220;Oh, private cloud is too unreliable&8221; &8212; but I&8217;m not.
The thing is, public cloud isn&8217;t any more or less reliable than private cloud; it&8217;s just that you&8217;re not the one responsible for keeping it up and running.
And therein lies the problem.
If AWS S3 goes down, there is precisely zero you can do about it. Oh, it&8217;s not that there&8217;s nothing you can do to keep your application up; that&8217;s a different matter, which we&8217;ll get to in a moment.  But there&8217;s nothing that you can do to get S3 (or EC2, Google Compute Engine, or whatever public cloud service we&8217;re talking about) back up and running. Chances are you won&8217;t even know there&8217;s an issue until it starts to affect you &8212; and your customers.
A while back my colleague Amar Kapadia compared the costs of a DIY private cloud with a vendor distribution and with managed cloud service. In that calculation, he included the cost of downtime as part of the cost of DIY and vendor distribution-based private clouds. But really, as yesterday proved, no cloud &8212; even one operated by the largest public cloud in the world &8212; is beyond downtime. It&8217;s all in what you do about it.
So what can you do about it?
Have you heard the expression, &8220;The best defense is a good offense&8221;?  Well, it’s true for cloud operations too. In an ideal situation, you will know exactly what&8217;s going on in your cloud at all times, and take action to solve problems BEFORE they happen. You&8217;d want to know that the error rate for your storage is trending upwards before the data center fails, so you can troubleshoot and solve the problem. You&8217;d want to know that a server is running slow so you can find out why and potentially replace it before it dies on you, possibly taking critical workloads with it.
And while we&8217;re at it, a true cloud application should be able to weather the storm of a dying hypervisor or even a storage failure; they are designed to be fault-tolerant. Pure play open cloud is about building your cloud and applications so that they&8217;re not even vulnerable to the failure of a data center.
But what does that mean?
What is Pure Play Open Cloud?
You&8217;ll be hearing a lot more about Pure Play Open Cloud in the coming months, but for the purposes of our discussion, it means the following:
Cloud-based infrastructure that&8217;s agnostic to the hardware and underlying data center (so it can run anywhere), based on open source software such as OpenStack, Kubernetes, Ceph, networking software such as OpenContrail (so that there&8217;s no vendor lock-in, and you can move it between a hosted environment and your own) and managed as infrastructure-as-code, using CI/CD pipelines, and so on, to enable reliability and scale.
Well, that&8217;s a mouthful! What does it mean in practice?  
It means that the ideal situation is one in which you:

Are not dependent on a single vendor or cloud
Can react quickly to technical problems
Have visibility into the underlying cloud
Have support (and help) fixing issues before they become problems

Sounds great, but making it happen isn&8217;t always easy. Let&8217;s look at these things one at a time.
Not being dependant on a single vendor or cloud
Part of the impetus behind the development of OpenStack was the realization that while Amazon Web Services enabled a whole new way of working, it had one major flaw: complete dependance on AWS.  
The problems here were both technological and financial. AWS makes a point of trying to bring prices down overall, but the bigger you grow, incremental cost increases are going to happen; there&8217;s just no way around that. And once you&8217;ve decided that you need to do something else, if your entire infrastructure is built around AWS products and APIs, you&8217;re stuck.
A better situation would be to build your infrastructure and application in such a way that it&8217;s agnostic to the hardware and underlying infrastructure. If your application doesn&8217;t care if it&8217;s running on AWS or OpenStack, then you can create an OpenStack infrastructure that serves as the base for your application, and use external resources such as AWS or GCE for emergency scaling &8212; or damage control in case of emergency.
Reacting quickly to technical problems
In an ideal world, nobody would have been affected by the outage in AWS S3&8217;s us-east-1 region, because their applications would have been architected with a presence in multiple regions. That&8217;s what regions are for. Rarely, however, does this happen.
Build your applications so that they have &8212; or at the very least, CAN have &8212; a presence in multiple locations. Ideally, they&8217;re spread out by default, so if there&8217;s a problem in one &8220;place&8221;, the application keeps running. This redundancy can get expensive, though, so the next best thing would be to have it detect a problem and switch over to a fail-safe or alternate region in case of emergency. At the bare minimum, you should be able to manually change over to a different option once a problem has been detected.
Preferably, this would happen before the situation becomes critical.
Having visibility into the underlying cloud
Having visibility into the underlying cloud is one area where private or managed cloud definitely has the advantage over public cloud.  After all, one of the basic tenets of cloud is that you don&8217;t necessarily care about the specific hardware running your application, which is fine &8212; unless you&8217;re responsible for keeping it running.
In that case, using tools such as StackLight (for OpenStack) or Prometheus (for Kubernetes) can give you insight into what&8217;s going on under the covers. You can see whether a problem is brewing, and if it is, you can troubleshoot to determine whether the problem is the cloud itself, or the applications running on it.
Once you determine that you do have a problem with your cloud (as opposed to the applications running on it), you can take action immediately.
Support (and help) fixing issues before they become problems
Preventing and fixing problems is, for many people, where the rubber hits the road. With a serious shortage of cloud experts, many companies are nervous about trusting their cloud to their own internal people.
It doesn&8217;t have to be that way.
While it would seem like the least expensive way of getting into cloud is the &8220;do it yourself&8221; approach &8212; after all, the software&8217;s free, right? &8212; long term, that&8217;s not necessarily true.
The traditional answer is to use a vendor distribution and purchase support, and that&8217;s definitely a viable option.
A second option that&8217;s becoming more common is the notion of &8220;managed cloud.&8221;  In this situation, your cloud may or may not be on your premises, but the important part is that it&8217;s overseen by experts who know the signs to look for and are able to make sure that your cloud maintains a certain SLA &8212; without taking away your control.
For example, Mirantis Managed OpenStack is a service that monitors your cloud 24/7 and can literally fix problems before they happen. It involves remote monitoring, a CI/CD infrastructure, KPI reporting, and even operational support, if necessary. But Mirantis Managed OpenStack is designed on the notion of Build-Operate-Transfer; everything is built on open standards, so you&8217;re not locked in; when you&8217;re ready, you can take over and transition to a lower level of support &8212; or even take over entirely, if you want.
What matters is that you have help that keeps you running without keeping you trapped.
Taking control of your cloud destiny
The important thing here is that while it may seem easy to rely on a huge cloud vendor to do everything for you, it&8217;s not necessarily in your best interest. Take control of your cloud, and take responsibility for making sure that you have options &8212; and more importantly, that your applications have options too.
The post How to avoid getting clobbered when your cloud host goes down appeared first on Mirantis | Pure Play Open Cloud.
Quelle: Mirantis

Introduction to Salt and SaltStack

The post Introduction to Salt and SaltStack appeared first on Mirantis | Pure Play Open Cloud.
The amazing world of configuration management software is really well populated these days. You may already have looked at Puppet, Chef or Ansible but today we focus on SaltStack. Simplicity is at its core, without any compromise on speed or scalability. In fact, some users have up to 10,000 minions or more. In this article, we&;re going to give you a look at what Salt is and how it works.
Salt architecture
Salt remote execution is built on top of an event bus, which makes it unique. It uses a server-agent communication model where the server is called the salt master and the agents the salt minions.
Salt minions receive commands simultaneously from the master and contain everything required to execute commands locally and report back to salt master. Communication between master and minions happens over a high-performance data pipe that use ZeroMQ or raw TCP, and messages are serialized using MessagePack to enable fast and light network traffic. Salt uses public keys for authentication with the master daemon, then uses faster AES encryption for payload communication.
State description is done using YAML and remote execution is possible over a CLI, and programming or extending Salt isn’t a must.
Salt is heavily pluggable; each function can be replaced by a plugin implemented as a Python module. For example, you can replace the data store, the file server, authentication mechanism, even the state representation. So when I said state representation is done using YAML, I’m talking about the Salt default, which can be replaced by JSON, Jinja, Wempy, Mako, or Py Objects. But don’t freak out. Salt comes with default options for all these things, which enables you to jumpstart the system and customize it when the need arises.
Terminology
It&8217;s easy to be overwhelmed by the obscure vocabulary that Salt introduces, so here are the main salt concepts which make it unique.

salt master &; sends commands to minions
salt minions &8211; receives commands from master
execution modules &8211; ad hoc commands
grains &8211; static information about minions
pillar &8211; secure user-defined variables stored on master and assigned to minions (equivalent to data bags in Chef or Hiera in Puppet)
formulas (states) &8211; representation of a system configuration, a grouping of one or more state files, possibly with pillar data and configuration files or anything else which defines a neat package for a particular application.
mine &8211; area on the master where results from minion executed commands can be stored, such as the IP address of a backend webserver, which can then be used to configure a load balancer
top file &8211; matches formulas and pillar data to minions
runners &8211; modules executed on the master
returners &8211; components that inject minion data to another system
renderers &8211; components that run the template to produce the valid state of configuration files. The default renderer uses Jinja2 syntax and outputs YAML files.
reactor &8211; component that triggers reactions on events
thorium &8211; a new kind of reactor, which is still experimental.
beacons &8211; a little piece of code on the minion that listens for events such as server failure or file changes. When it registers on of these events, it informs the master. Reactors are often used to do self healing.
proxy minions &8211; components that translate Salt Language to device specific instructions in order to bring the device to the desired state using its API, or over SSH.
salt cloud &8211; command to bootstrap cloud nodes
salt ssh &8211; command to run commands on systems without minions

You’ll find a great overview of all of this on the official docs.
Installation
Salt is built on top of lots of Python modules. Msgpack, YAML, Jinja2, MarkupSafe, ZeroMQ, Tornado, PyCrypto and M2Crypto are all required. To keep your system clean, easily upgradable and to avoid conflicts, the easiest installation workflow is to use system packages.
Salt is operating system specific; in the examples in this article, I’ll be using Ubuntu 16.04 [Xenial Xerus]; for other Operating Systems consult the salt repo page.  For simplicity&8217;s sake, you can install the master and the minion on a single machine, and that&8217;s what we&8217;ll be doing here.  Later, we&8217;ll talk about how you can add additional minions.

To install the master and the minion, execute the following commands:
$ sudo su
# apt-get update
# apt-get upgrade
# apt-get install curl wget
# echo “deb [arch=amd64] http://apt.tcpcloud.eu/nightly xenial tcp-salt” > /etc/apt/sources.list
# wget -O – http://apt.tcpcloud.eu/public.gpg | sudo apt-key add –
# apt-get clean
# apt-get update
# apt-get install -y salt-master salt-minion reclass

Finally, create the  directory where you’ll store your state files.
# mkdir -p /srv/salt

You should now have Salt installed on your system, so check to see if everything looks good:
# salt –version
You should see a result something like this:
salt 2016.3.4 (Boron)

Alternative installations
If you can’t find packages for your distribution, you can rely on Salt Bootstrap, which is an alternative installation method, look below for further details.
Configuration
To finish your configuration, you&8217;ll need to execute a few more steps:

If you have firewalls in the way, make sure you open up both port 4505 (the publish port) and 4506 (the return port) to the Salt master to let the minions talk to it.
Now you need to configure your Minion to connect to your master.  Edit the file /etc/salt/minion.d/minion.conf  and Change the following lines as indicated below:

# Set the location of the salt master server. If the master server cannot be
# resolved, then the minion will fail to start.
master: localhost

# If multiple masters are specified in the ‘master’ setting, the default behavior
# is to always try to connect to them in the order they are listed. If random_master is
# set to True, the order will be randomized instead. This can be helpful in distributing

# Explicitly declare the id for this minion to use, if left commented the id
# will be the hostname as returned by the python call: socket.getfqdn()
# Since salt uses detached ids it is possible to run multiple minions on the
# same machine but with different ids, this can be useful for salt compute
# clusters.
id: saltstack-m01

# Append a domain to a hostname in the event that it does not exist.  This is
# useful for systems where socket.getfqdn() does not actually result in a
# FQDN (for instance, Solaris).
:

As you can see, we&8217;re telling the minion where to find the master so it can connect &; in this case, it&8217;s just localhost, but if that&8217;s not the case for you, you&8217;ll want to change it.  We&8217;ve also given this particular minion an id of saltstack-m01; that&8217;s a completely arbitrary name, so you can use whatever you want.  Just make sure to substitute in the examples!
Before being able you can play around, you&8217;ll need to restart the required Salt services to pick up the changes:
# service salt-minion restart
# service salt-master restart

Make sure services are also started at boot time:
# systemctl enable salt-master.service
# systemctl enable salt-minion.service

Before the master can do anything on the minion, the master needs to trust it, so accept the corresponding key of each of your minion as follows:
# salt-key
Accepted Keys:
Denied Keys:
Unaccepted Keys:
saltstack-m01
Rejected Keys:

Before accepting it, you can validate it looks good. First inspect it:
# salt-key -f saltstack-m01
Unaccepted Keys:
saltstack-m01:  98:f2:e1:9f:b2:b6:0e:fe:cb:70:cd:96:b0:37:51:d0

Then compare it with the minion key:
# salt-call –local key.finger
local:
98:f2:e1:9f:b2:b6:0e:fe:cb:70:cd:96:b0:37:51:d0

It looks the same, so go ahead and accept it:/span>
salt-key -a saltstack-m01

Repeat this process of installing salt-minion and accepting the keys to add new minions to your environment. Consult the documentation to get more details regarding the configuration of minions or more generally this documentation for all salt configuration options.
Remote execution
Now that everything&8217;s installed and configured, let&8217;s make sure it&8217;s actually working. The first, most obvious thing we could do with our master/minion infrastructure is to run a command remotely. For example we can test whether the minion is alive by using the test.ping command:
# salt ‘saltstack-m01′ test.ping
saltstack-m01:
   True
As you can see here, we&8217;re calling salt, and we&8217;re feeding it a specific minion, and a command to run on that minion.  We could, if we wanted to, send this command to more than one minion. For example, we could send it to all minions:
# salt ‘*’ test.ping
saltstack-m01:
   True
In this case, we have only one, but if there were more, salt would cycle through all of them giving you the appropriate response.
So that should get you started. Next time, we&8217;ll look at some of the more complicated things you can do with Salt.
The post Introduction to Salt and SaltStack appeared first on Mirantis | Pure Play Open Cloud.
Quelle: Mirantis

The first and final words on OpenStack availability zones

The post The first and final words on OpenStack availability zones appeared first on Mirantis | Pure Play Open Cloud.
Availability zones are one of the most frequently misunderstood and misused constructs in OpenStack. Each cloud operator has a different idea about what they are and how to use them. What&;s more, each OpenStack service implements availability zones differently &; if it even implements them at all.
Often, there isn’t even agreement over the basic meaning or purpose of availability zones.
On the one hand, you have the literal interpretation of the words “availability zone”, which would lead us to think of some logical subdivision of resources into failure domains, allowing cloud applications to intelligently deploy in ways to maximize their availability. (We’ll be running with this definition for the purposes of this article.)
On the other hand, the different ways that projects implement availability zones lend themselves to certain ways of using the feature as a result. In other words, because this feature has been implemented in a flexible manner that does not tie us down to one specific concept of an availability zone, there&8217;s a lot of confusion over how to use them.
In this article, we&8217;ll look at the traditional definition of availability zones, insights into and best practices for planning and using them, and even a little bit about non-traditional uses. Finally, we hope to address the question: Are availability zones right for you?
OpenStack availability zone Implementations
One of the things that complicates use of availability zones is that each OpenStack project implements them in their own way (if at all). If you do plan to use availability zones, you should evaluate which OpenStack projects you&8217;re going to use support them, and how that affects your design and deployment of those services.
For the purposes of this article, we will look at three core services with respect to availability zones: , Cinder, and Neutron. We won&8217;t go into the steps to set up availability zones, but but instead we&8217;ll focus on a few of the key decision points, limitations, and trouble areas with them.
Nova availability zones
Since host aggregates were first introduced in OpenStack Grizzly, I have seen a lot of confusion about availability zones in Nova. Nova tied their availability zone implementation to host aggregates, and because the latter is a feature unique to the Nova project, its implementation of availability zones is also unique.
I have had many people tell me they use availability zones in Nova, convinced they are not using host aggregates. Well, I have news for these people &8212; all* availability zones in Nova are host aggregates (though not all host aggregates are availability zones):
* Exceptions being the default_availability_zone that compute nodes are placed into when not in another user-defined availability zone, and the internal_service_availability_zone where other nova services live
Some of this confusion may come from the nova CLI. People may do a quick search online, see they can create an availability zone with one command, and may not realize that they’re actually creating a host aggregate. Ex:
$ nova aggregate-create <aggregate name> <AZ name>
$ nova aggregate-create HA1 AZ1
+—-+———+——————-+——-+————————+
| Id | Name    | Availability Zone | Hosts | Metadata               |
+—-+———+——————-+——-+————————+
| 4  |   HA1   | AZ1               |       | ‘availability_zone=AZ1’|
+—-+———+——————-+——-+————————+
I have seen people get confused with the second argument (the AZ name). This is just a shortcut for setting the availability_zone metadata for a new host aggregate you want to create.
This command is equivalent to creating a host aggregate, and then setting the metadata:
$ nova aggregate-create HA1
+—-+———+——————-+——-+———-+
| Id | Name    | Availability Zone | Hosts | Metadata |
+—-+———+——————-+——-+———-+
| 7  |   HA1   | –                 |       |          |
+—-+———+——————-+——-+———-+
$ nova aggregate-set-metadata HA1 availability_zone=AZ1
Metadata has been successfully updated for aggregate 7.
+—-+———+——————-+——-+————————+
| Id | Name    | Availability Zone | Hosts | Metadata               |
+—-+———+——————-+——-+————————+
| 7  |   HA1   | AZ1               |       | ‘availability_zone=AZ1’|
+—-+———+——————-+——-+————————+
Doing it this way, it’s more apparent that the workflow is the same as any other host aggregate, the only difference is the “magic” metadata key availability_zone which we set to AZ1 (notice we also see AZ1 show up under the Availability Zone column). And now when we add compute nodes to this aggregate, they will be automatically transferred out of the default_availability_zone and into the one we have defined. For example:
Before:
$ nova availability-zone-list
| nova              | available                              |
| |- node-27        |                                        |
| | |- nova-compute | enabled :-) 2016-11-06T05:13:48.000000 |
+——————-+—————————————-+
After:
$ nova availability-zone-list
| AZ1               | available                              |
| |- node-27        |                                        |
| | |- nova-compute | enabled :-) 2016-11-06T05:13:48.000000 |
+——————-+—————————————-+
Note that there is one behavior that sets apart the availability zone host aggregates apart from others. Namely, nova does not allow you to assign the same compute host to multiple aggregates with conflicting availability zone assignments. For example, we can first add compute a node to the previously created host aggregate with availability zone AZ1:
$ nova aggregate-add-host HA1 node-27
Host node-27 has been successfully added for aggregate 7
+—-+——+——————-+———-+————————+
| Id | Name | Availability Zone | Hosts    | Metadata               |
+—-+——+——————-+———-+————————+
| 7  | HA1  | AZ1               | ‘node-27’| ‘availability_zone=AZ1’|
+—-+——+——————-+———-+————————+
Next, we create a new host aggregate for availability zone AZ2:
$ nova aggregate-create HA2

+—-+———+——————-+——-+———-+
| Id | Name    | Availability Zone | Hosts | Metadata |
+—-+———+——————-+——-+———-+
| 13 |   HA2   | –                 |       |          |
+—-+———+——————-+——-+———-+

$ nova aggregate-set-metadata HA2 availability_zone=AZ2
Metadata has been successfully updated for aggregate 13.
+—-+———+——————-+——-+————————+
| Id | Name    | Availability Zone | Hosts | Metadata               |
+—-+———+——————-+——-+————————+
| 13 |   HA2   | AZ2               |       | ‘availability_zone=AZ2’|
+—-+———+——————-+——-+————————+
Now if we try to add the original compute node to this aggregate, we get an error because this aggregate has a conflicting availability zone:
$ nova aggregate-add-host HA2 node-27
ERROR (Conflict): Cannot add host node-27 in aggregate 13: host exists (HTTP 409)
+—-+——+——————-+———-+————————+
| Id | Name | Availability Zone | Hosts    | Metadata               |
+—-+——+——————-+———-+————————+
| 13 | HA2  | AZ2               |          | ‘availability_zone=AZ2’|
+—-+——+——————-+———-+————————+
(Incidentally, it is possible to have multiple host aggregates with the same availability_zone metadata, and add the same compute host to both. However, there are few, if any, good reasons for doing this.)
In contrast, Nova allows you to assign this compute node to another host aggregate with other metadata fields, as long as the availability_zone doesn&8217;t conflict:
You can see this if you first create a third host aggregate:
$ nova aggregate-create HA3
+—-+———+——————-+——-+———-+
| Id | Name    | Availability Zone | Hosts | Metadata |
+—-+———+——————-+——-+———-+
| 16 |   HA3   | –                 |       |          |
+—-+———+——————-+——-+———-+
Next, tag  the host aggregate for some purpose not related to availability zones (for example, an aggregate to track compute nodes with SSDs):
$ nova aggregate-set-metadata HA3 ssd=True
Metadata has been successfully updated for aggregate 16.
+—-+———+——————-+——-+———–+
| Id | Name    | Availability Zone | Hosts |  Metadata |
+—-+———+——————-+——-+———–+
| 16 |   HA3   | –                 |       | ‘ssd=True’|
+—-+———+——————-+——-+———–+
Adding original node to another aggregate without conflicting availability zone metadata works:
$ nova aggregate-add-host HA3 node-27
Host node-27 has been successfully added for aggregate 16
+—-+——-+——————-+———–+————+
| Id | Name  | Availability Zone | Hosts     |  Metadata  |
+—-+——-+——————-+———–+————+
| 16 | HA3   | –                 | ‘node-27′ | ‘ssd=True’ |
+—-+——-+——————-+———–+————+
(Incidentally, Nova will also happily let you assign the same compute node to another aggregate with ssd=False for metadata, even though that clearly doesn&8217;t make sense. Conflicts are only checked/enforced in the case of the availability_zone metadata.)
Nova configuration also holds parameters relevant to availability zone behavior. In the nova.conf read by your nova-api service, you can set a default availability zone for scheduling, which is used if users do not specify an availability zone in the API call:
[DEFAULT]
default_schedule_zone=AZ1
However, most operators leave this at its default setting (None), because it allows users who don’t care about availability zones to omit it from their API call, and the workload will be scheduled to any availability zone where there is available capacity.
If a user requests an invalid or undefined availability zone, the Nova API will report back with an HTTP 400 error. There is no availability zone fallback option.
Cinder
Creating availability zones in Cinder is accomplished by setting the following configuration parameter in cinder.conf, on the nodes where your cinder-volume service runs:
[DEFAULT]
storage_availability_zone=AZ1
Note that you can only set the availability zone to one value. This is consistent with availability zones in other OpenStack projects that do not allow for the notion of overlapping failure domains or multiple failure domain levels or tiers.
The change takes effect when you restart your cinder-volume services. You can confirm your availability zone assignments as follows:
cinder service-list
+—————+——————-+——+———+——-+
|     Binary    |        Host       | Zone | Status  | State |
+—————+——————-+——+———+——-+
| cinder-volume | hostname1@LVM     |  AZ1 | enabled |   up  |
| cinder-volume | hostname2@LVM     |  AZ2 | enabled |   up  |
If you would like to establish a default availability zone, you can set the this parameter in cinder.conf on the cinder-api nodes:
[DEFAULT]
default_availability_zone=AZ1
This instructs Cinder which availability zone to use if the API call did not specify one. If you don’t, it will use a hardcoded default, nova. In the case of our example, where we&8217;ve set the default availability zone in Nova to AZ1, this would result in a failure. This also means that unlike Nova, users do not have the flexibility of omitting availability zone information and expecting that Cinder will select any available backend with spare capacity in any availability zone.
Therefore, you have a choice with this parameter. You can set it to one of your availability zones so API calls without availability zone information don’t fail, but causing a potential situation of uneven storage allocation across your availability zones. Or, you can not set this parameter, and accept that user API calls that forget or omit availability zone info will fail.
Another option is to set the default to a non-existent availability zone you-must-specify-an-AZ or something similar, so when the call fails due to the non-existant availability zone, this information will be included in the error message sent back to the client.
Your storage backends, storage drivers, and storage architecture may also affect how you set up your availability zones. If we are using the reference Cinder LVM ISCSI Driver deployed on commodity hardware, and that hardware fits the same availability zone criteria of our computes, then we could setup availability zones to match what we have defined in Nova. We could also do the same if we had a third party storage appliance in each availability zone, e.g.:
|     Binary    |           Host          | Zone | Status  | State |
| cinder-volume | hostname1@StorageArray1 |  AZ1 | enabled |   up  |
| cinder-volume | hostname2@StorageArray2 |  AZ2 | enabled |   up  |
(Note: Notice that the hostnames (hostname1 and hostname2) are still different in this example. The cinder multi-backend feature allows us to configure multiple storage backends in the same cinder.conf (for the same cinder-volume service), but Cinder availability zones can only be defined per cinder-volume service, and not per-backend per-cinder-volume service. In other words, if you define multiple backends in one cinder.conf, they will all inherit the same availability zone.)
However, in many cases if you’re using a third party storage appliance, then these systems usually have their own built-in redundancy that exist outside of OpenStack notions of availability zones. Similarly if you use a distributed storage solution like Ceph, then availability zones have little or no meaning in this context. In this case, you can forgo Cinder availability zones.
The one issue in doing this, however, is that any availability zones you defined for Nova won’t match. This can cause problems when Nova makes API calls to Cinder &; for example, when performing a Boot from Volume API call through Nova. If Nova decided to provision your VM in AZ1, it will tell Cinder to provision a boot volume in AZ1, but Cinder doesn’t know anything about AZ1, so this API call will fail. To prevent this from happening, we need to set the following parameter in cinder.conf on your nodes running cinder-api:
[DEFAULT]
allow_availability_zone_fallback=True
This parameter prevents the API call from failing, because if the requested availability zone does not exist, Cinder will fallback to another availability zone (whichever you defined in default_availability_zone parameter, or in storage_availability_zone if the default is not set). The hardcoded default storage_availability_zone is nova, so the fallback availability zone should match the default availability zone for your cinder-volume services, and everything should work.
The easiest way to solve the problem, however is to remove the AvailabilityZoneFilter from your filter list in cinder.conf on nodes running cinder-scheduler. This makes the scheduler ignore any availability zone information passed to it altogether, which may also be helpful in case of any availability zone configuration mismatch.
Neutron
Availability zone support was added to Neutron in the Mitaka release. Availability zones can be set for DHCP and L3 agents in their respective configuration files:
[AGENT]
Availability_zone = AZ1
Restart the agents, and confirm availability zone settings as follows:
neutron agent-show <agent-id>
+———————+————+
| Field               | Value      |
+———————+————+
| availability_zone   | AZ1        |

If you would like to establish a default availability zone, you can set the this parameter in neutron.conf on neutron-server nodes:
[DEFAULT]
default_availability_zones=AZ1,AZ2
This parameters tells Neutron which availability zones to use if the API call did not specify any. Unlike Cinder, you can specify multiple availability zones, and leaving it undefined places no constraints in scheduling, as there are no hard coded defaults. If you have users making API calls that do not care about the availability zone, then you can enumerate all your availability zones for this parameter, or simply leave it undefined &8211; both would yield the same result.
Additionally, when users do specify an availability zone, such requests are fulfilled as a “best effort” in Neutron. In other words, there is no need for an availability zone fallback parameter, because your API call still execute even if your availability zone hint can’t be satisfied.
Another important distinction that sets Neutron aside from Nova and Cinder is that it implements availability zones as scheduler hints, meaning that on the client side you can repeat this option to chain together multiple availability zone specifications in the event that more than one availability zone would satisfy your availability criteria. For example:
$ neutron net-create –availability-zone-hint AZ1 –availability-zone-hint AZ2 new_network
As with Cinder, the Neutron plugins and backends you’re using deserve attention, as the support or need for availability zones may be different depending on their implementation. For example, if you’re using a reference Neutron deployment with the ML2 plugin and with DHCP and L3 agents deployed to commodity hardware, you can likely place these agents consistently according to the same availability zone criteria used for your computes.
Whereas in contrast, other alternatives such as the Contrail plugin for Neutron do not support availability zones. Or if you are using Neutron DVR for example, then availability zones have limited significance for Layer 3 Neutron.
OpenStack Project availability zone Comparison Summary
Before we move on, it&8217;s helpful to review how each project handles availability zones.

Nova
Cinder
Neutron

Default availability zone scheduling
Can set to one availability zone or None
Can set one availability zone; cannot set None
Can set to any list of availability zones or none

Availability zone fallback
None supported
Supported through configuration
N/A; scheduling to availability zones done on a best effort basis

Availability zone definition restrictions
No more than availability zone per nova-compute
No more than 1 availability zone per cinder-volume
No more than 1 availability zone per neutron agent

Availability zone client restrictions
Can specify one availability zone or none
Can specify one availability zone or none
Can specify an arbitrary number of availability zones

Availability zones typically used when you have &;
Commodity HW for computes, libvirt driver
Commodity HW for storage, LVM iSCSI driver
Commodity HW for neutron agents, ML2 plugin

Availability zones not typically used when you have&8230;
Third party hypervisor drivers that manage their own HA for VMs (DRS for VCenter)
Third party drivers, backends, etc. that manage their own HA
Third party plugins, backends, etc. that manage their own HA

Best Practices for availability zones
Now let&8217;s talk about how to best make use of availability zones.
What should my availability zones represent?
The first thing you should do as a cloud operator is to nail down your own accepted definition of an availability zone and how you will use them, and remain consistent. You don’t want to end up in a situation where availability zones are taking on more than one meaning in the same cloud. For example:
Fred’s AZ            | Example of AZ used to perform tenant workload isolation
VMWare cluster 1 AZ | Example of AZ used to select a specific hypervisor type
Power source 1 AZ   | Example of AZ used to select a specific failure domain
Rack 1 AZ           | Example of AZ used to select a specific failure domain
Such a set of definitions would be a source of inconsistency and confusion in your cloud. It’s usually better to keep things simple with one availability zone definition, and use OpenStack features such as Nova Flavors or Nova/Cinder boot hints to achieve other requirements for multi-tenancy isolation, ability to select between different hypervisor options and other features, and so on.
Note that OpenStack currently does not support the concept of multiple failure domain levels/tiers. Even though we may have multiple ways to define failure domains (e.g., by power circuit, rack, room, etc), we must pick a single convention.
For the purposes of this article, we will discuss availability zones in the context of failure domains. However, we will cover one other use for availability zones in the third section.
How many availability zones do I need?
One question that customers frequently get hung up on is how many availability zones they should create. This can be tricky because the setup and management of availability zones involves stakeholders at every layer of the solution stack, from tenant applications to cloud operators, down to data center design and planning.

A good place to start is your cloud application requirements: How many failure domains are they designed to work with (i.e. redundancy factor)? The likely answer is two (primary + backup), three (for example, for a database or other quorum-based system), or one (for example, a legacy app with single points of failure). Therefore, the vast majority of clouds will have either 2, 3, or 1 availability zone.
Also keep in mind that as a general design principle, you want to minimize the number of availability zones in your environment, because the side effect of availability zone proliferation is that you are dividing your capacity into more resource islands. The resource utilization in each island may not be equal, and now you have an operational burden to track and maintain capacity in each island/availability zone. Also, if you have a lot of availability zones (more than the redundancy factor of tenant applications), tenants are left to guess which availability zones to use and which have available capacity.
How do I organize my availability zones?
The value proposition of availability zones is that tenants are able to achieve a higher level of availability in their applications. In order to make good on that proposition, we need to design our availability zones in ways that mitigate single points of failure.             
For example, if our resources are split between two power sources in the data center, then we may decide to define two resource pools (availability zones) according to their connected power source:
Or, if we only have one TOR switch in our racks, then we may decide to define availability zones by rack. However, here we can run into problems if we make each rack its own availability zone, as this will not scale from a capacity management perspective for more than 2-3 racks/availability zones (because of the &;resource island&; problem). In this case, you might consider dividing/arranging your total rack count into into 2 or 3 logical groupings that correlate to your defined availability zones:
We may also find situations where we have redundant TOR switch pairs in our racks, power source diversity to each rack, and lack a single point of failure. You could still place racks into availability zones as in the previous example, but the value of availability zones is marginalized, since you need to have a double failure (e.g., both TORs in the rack failing) to take down a rack.
Ultimately with any of the above scenarios, the value added by your availability zones will depend on the likelihood and expression of failure modes &8211; both the ones you have designed for, and the ones you have not. But getting accurate and complete information on such failure modes may not be easy, so the value of availability zones from this kind of unplanned failure can be difficult to pin down.
There is however another area where availability zones have the potential to provide value &8212; planned maintenance. Have you ever needed to move, recable, upgrade, or rebuild a rack? Ever needed to shut off power to part of the data center to do electrical work? Ever needed to apply disruptive updates to your hypervisors, like kernel or QEMU security updates? How about upgrades of OpenStack or the hypervisor operating system?
Chances are good that these kinds of planned maintenance are a higher source of downtime than unplanned hardware failures that happen out of the blue. Therefore, this type of planned maintenance can also impact how you define your availability zones. In general the grouping of racks into availability zones (described previously) still works well, and is the most common availability zone paradigm we see for our customers.
However, it could affect the way in which you group your racks into availability zones. For example, you may choose to create availability zones based on physical parameters like which floor, room or building the equipment is located in, or other practical considerations that would help in the event you need to vacate or rebuild certain space in your DC(s). Ex:
One of the limitations of the OpenStack implementation of availability zones made apparent in this example is that you are forced to choose one definition. Applications can request a specific availability zone, but are not offered the flexibility of requesting building level isolation, vs floor, room, or rack level isolation. This will be a fixed, inherited property of the availability zones you create. If you need more, then you start to enter the realm of other OpenStack abstractions like Regions and Cells.
Other uses for availability zones?
Another way in which people have found to use availability zones is in multi-hypervisor environments. In the ideal world of the “implementation-agnostic” cloud, we would abstract such underlying details from users of our platform. However there are some key differences between hypervisors that make this aspiration difficult. Take the example of KVM & VMWare.
When an iSCSI target is provisioned through with the LVM iSCSI Cinder driver, it cannot be attached to or consumed by ESXi nodes. The provision request must go through VMWare’s VMDK Cinder driver, which proxies the creation and attachment requests to vCenter. This incompatibility can cause errors and thus a negative user experience issues if the user tries to attach a volume to their hypervisor provisioned from the wrong backend.
To solve this problem, some operators use availability zones as a way for users to select hypervisor types (for example, AZ_KVM1, AZ_VMWARE1), and set the following configuration in their nova.conf:
[cinder]
cross_az_attach = False
This presents users with an error if they attempt to attach a volume from one availability zone (e.g., AZ_VMWARE1) to another availability zone (e.g., AZ_KVM1). The call would have certainly failed regardless, but with a different error from farther downstream from one of the nova-compute agents, instead of from nova-api. This way, it&8217;s easier for the user to see where they went wrong and correct the problem.
In this case, the availability zone also acts as a tag to remind users which hypervisor their VM resides on.
In my opinion, this is stretching the definition of availability zones as failure domains. VMWare may be considered its own failure domain separate from KVM, but that’s not the primary purpose of creating availability zones this way. The primary purpose is to differentiate hypervisor types. Different hypervisor types have a different set of features and capabilities. If we think about the problem in these terms, there are a number of other solutions that come to mind:

Nova Flavors: Define a “VMWare” set of flavors that map to your VCenter cluster(s). If your tenants that use VMWare are different than your tenants who use KVM, you can make them private flavors, ensuring that tenants only ever see or interact with the hypervisor type they expect.
Cinder: Similarly for Cinder, you can make the VMWare backend private to specific tenants, and/or set quotas differently for that backend, to ensure that tenants will only allocate from the correct storage pools for their hypervisor type.
Image metadata: You can tag your images according to the hypervisor they run on. Set image property hypervisor_type to vmware for VMDK images, and to qemu for other images. The ComputeCapabilitiesFilter in Nova will honor the hypervisor placement request.

Soo… Are availability zones right for me?
So wrapping up, think of availability zones in terms of:

Unplanned failures: If you have a good history of failure data, well understood failure modes, or some known single point of failure that availability zones can help mitigate, then availability zones may be a good fit for your environment.
Planned maintenance: If you have well understood maintenance processes that have awareness of your availability zone definitions and can take advantage of them, then availability zones may be a good fit for your environment. This is where availability zones can provide some of the creates value, but is also the most difficult to achieve, as it requires intelligent availability zone-aware rolling updates and upgrades, and affects how data center personnel perform maintenance activities.
Tenant application design/support: If your tenants are running legacy apps, apps with single points of failure, or do not support use of availability zones in their provisioning process, then availability zones will be of no use for these workloads.
Other alternatives for achieving app availability: Workloads built for geo-redundancy can achieve the same level of HA (or better) in a multi-region cloud. If this were the case for your cloud and your cloud workloads, availability zones would be unnecessary.
OpenStack projects: You should factor into your decision the limitations and differences in availability zone implementations between different OpenStack projects, and perform this analysis for all OpenStack projects in your scope of deployment.

The post The first and final words on OpenStack availability zones appeared first on Mirantis | Pure Play Open Cloud.
Quelle: Mirantis

The first and final words on OpenStack availability zones

The post The first and final words on OpenStack availability zones appeared first on Mirantis | Pure Play Open Cloud.
Availability zones are one of the most frequently misunderstood and misused constructs in OpenStack. Each cloud operator has a different idea about what they are and how to use them. What&;s more, each OpenStack service implements availability zones differently &; if it even implements them at all.
Often, there isn’t even agreement over the basic meaning or purpose of availability zones.
On the one hand, you have the literal interpretation of the words “availability zone”, which would lead us to think of some logical subdivision of resources into failure domains, allowing cloud applications to intelligently deploy in ways to maximize their availability. (We’ll be running with this definition for the purposes of this article.)
On the other hand, the different ways that projects implement availability zones lend themselves to certain ways of using the feature as a result. In other words, because this feature has been implemented in a flexible manner that does not tie us down to one specific concept of an availability zone, there&8217;s a lot of confusion over how to use them.
In this article, we&8217;ll look at the traditional definition of availability zones, insights into and best practices for planning and using them, and even a little bit about non-traditional uses. Finally, we hope to address the question: Are availability zones right for you?
OpenStack availability zone Implementations
One of the things that complicates use of availability zones is that each OpenStack project implements them in their own way (if at all). If you do plan to use availability zones, you should evaluate which OpenStack projects you&8217;re going to use support them, and how that affects your design and deployment of those services.
For the purposes of this article, we will look at three core services with respect to availability zones: , Cinder, and Neutron. We won&8217;t go into the steps to set up availability zones, but but instead we&8217;ll focus on a few of the key decision points, limitations, and trouble areas with them.
Nova availability zones
Since host aggregates were first introduced in OpenStack Grizzly, I have seen a lot of confusion about availability zones in Nova. Nova tied their availability zone implementation to host aggregates, and because the latter is a feature unique to the Nova project, its implementation of availability zones is also unique.
I have had many people tell me they use availability zones in Nova, convinced they are not using host aggregates. Well, I have news for these people &8212; all* availability zones in Nova are host aggregates (though not all host aggregates are availability zones):
* Exceptions being the default_availability_zone that compute nodes are placed into when not in another user-defined availability zone, and the internal_service_availability_zone where other nova services live
Some of this confusion may come from the nova CLI. People may do a quick search online, see they can create an availability zone with one command, and may not realize that they’re actually creating a host aggregate. Ex:
$ nova aggregate-create <aggregate name> <AZ name>
$ nova aggregate-create HA1 AZ1
+—-+———+——————-+——-+————————+
| Id | Name    | Availability Zone | Hosts | Metadata               |
+—-+———+——————-+——-+————————+
| 4  |   HA1   | AZ1               |       | ‘availability_zone=AZ1’|
+—-+———+——————-+——-+————————+
I have seen people get confused with the second argument (the AZ name). This is just a shortcut for setting the availability_zone metadata for a new host aggregate you want to create.
This command is equivalent to creating a host aggregate, and then setting the metadata:
$ nova aggregate-create HA1
+—-+———+——————-+——-+———-+
| Id | Name    | Availability Zone | Hosts | Metadata |
+—-+———+——————-+——-+———-+
| 7  |   HA1   | –                 |       |          |
+—-+———+——————-+——-+———-+
$ nova aggregate-set-metadata HA1 availability_zone=AZ1
Metadata has been successfully updated for aggregate 7.
+—-+———+——————-+——-+————————+
| Id | Name    | Availability Zone | Hosts | Metadata               |
+—-+———+——————-+——-+————————+
| 7  |   HA1   | AZ1               |       | ‘availability_zone=AZ1’|
+—-+———+——————-+——-+————————+
Doing it this way, it’s more apparent that the workflow is the same as any other host aggregate, the only difference is the “magic” metadata key availability_zone which we set to AZ1 (notice we also see AZ1 show up under the Availability Zone column). And now when we add compute nodes to this aggregate, they will be automatically transferred out of the default_availability_zone and into the one we have defined. For example:
Before:
$ nova availability-zone-list
| nova              | available                              |
| |- node-27        |                                        |
| | |- nova-compute | enabled :-) 2016-11-06T05:13:48.000000 |
+——————-+—————————————-+
After:
$ nova availability-zone-list
| AZ1               | available                              |
| |- node-27        |                                        |
| | |- nova-compute | enabled :-) 2016-11-06T05:13:48.000000 |
+——————-+—————————————-+
Note that there is one behavior that sets apart the availability zone host aggregates apart from others. Namely, nova does not allow you to assign the same compute host to multiple aggregates with conflicting availability zone assignments. For example, we can first add compute a node to the previously created host aggregate with availability zone AZ1:
$ nova aggregate-add-host HA1 node-27
Host node-27 has been successfully added for aggregate 7
+—-+——+——————-+———-+————————+
| Id | Name | Availability Zone | Hosts    | Metadata               |
+—-+——+——————-+———-+————————+
| 7  | HA1  | AZ1               | ‘node-27’| ‘availability_zone=AZ1’|
+—-+——+——————-+———-+————————+
Next, we create a new host aggregate for availability zone AZ2:
$ nova aggregate-create HA2

+—-+———+——————-+——-+———-+
| Id | Name    | Availability Zone | Hosts | Metadata |
+—-+———+——————-+——-+———-+
| 13 |   HA2   | –                 |       |          |
+—-+———+——————-+——-+———-+

$ nova aggregate-set-metadata HA2 availability_zone=AZ2
Metadata has been successfully updated for aggregate 13.
+—-+———+——————-+——-+————————+
| Id | Name    | Availability Zone | Hosts | Metadata               |
+—-+———+——————-+——-+————————+
| 13 |   HA2   | AZ2               |       | ‘availability_zone=AZ2’|
+—-+———+——————-+——-+————————+
Now if we try to add the original compute node to this aggregate, we get an error because this aggregate has a conflicting availability zone:
$ nova aggregate-add-host HA2 node-27
ERROR (Conflict): Cannot add host node-27 in aggregate 13: host exists (HTTP 409)
+—-+——+——————-+———-+————————+
| Id | Name | Availability Zone | Hosts    | Metadata               |
+—-+——+——————-+———-+————————+
| 13 | HA2  | AZ2               |          | ‘availability_zone=AZ2’|
+—-+——+——————-+———-+————————+
(Incidentally, it is possible to have multiple host aggregates with the same availability_zone metadata, and add the same compute host to both. However, there are few, if any, good reasons for doing this.)
In contrast, Nova allows you to assign this compute node to another host aggregate with other metadata fields, as long as the availability_zone doesn&8217;t conflict:
You can see this if you first create a third host aggregate:
$ nova aggregate-create HA3
+—-+———+——————-+——-+———-+
| Id | Name    | Availability Zone | Hosts | Metadata |
+—-+———+——————-+——-+———-+
| 16 |   HA3   | –                 |       |          |
+—-+———+——————-+——-+———-+
Next, tag  the host aggregate for some purpose not related to availability zones (for example, an aggregate to track compute nodes with SSDs):
$ nova aggregate-set-metadata HA3 ssd=True
Metadata has been successfully updated for aggregate 16.
+—-+———+——————-+——-+———–+
| Id | Name    | Availability Zone | Hosts |  Metadata |
+—-+———+——————-+——-+———–+
| 16 |   HA3   | –                 |       | ‘ssd=True’|
+—-+———+——————-+——-+———–+
Adding original node to another aggregate without conflicting availability zone metadata works:
$ nova aggregate-add-host HA3 node-27
Host node-27 has been successfully added for aggregate 16
+—-+——-+——————-+———–+————+
| Id | Name  | Availability Zone | Hosts     |  Metadata  |
+—-+——-+——————-+———–+————+
| 16 | HA3   | –                 | ‘node-27′ | ‘ssd=True’ |
+—-+——-+——————-+———–+————+
(Incidentally, Nova will also happily let you assign the same compute node to another aggregate with ssd=False for metadata, even though that clearly doesn&8217;t make sense. Conflicts are only checked/enforced in the case of the availability_zone metadata.)
Nova configuration also holds parameters relevant to availability zone behavior. In the nova.conf read by your nova-api service, you can set a default availability zone for scheduling, which is used if users do not specify an availability zone in the API call:
[DEFAULT]
default_schedule_zone=AZ1
However, most operators leave this at its default setting (None), because it allows users who don’t care about availability zones to omit it from their API call, and the workload will be scheduled to any availability zone where there is available capacity.
If a user requests an invalid or undefined availability zone, the Nova API will report back with an HTTP 400 error. There is no availability zone fallback option.
Cinder
Creating availability zones in Cinder is accomplished by setting the following configuration parameter in cinder.conf, on the nodes where your cinder-volume service runs:
[DEFAULT]
storage_availability_zone=AZ1
Note that you can only set the availability zone to one value. This is consistent with availability zones in other OpenStack projects that do not allow for the notion of overlapping failure domains or multiple failure domain levels or tiers.
The change takes effect when you restart your cinder-volume services. You can confirm your availability zone assignments as follows:
cinder service-list
+—————+——————-+——+———+——-+
|     Binary    |        Host       | Zone | Status  | State |
+—————+——————-+——+———+——-+
| cinder-volume | hostname1@LVM     |  AZ1 | enabled |   up  |
| cinder-volume | hostname2@LVM     |  AZ2 | enabled |   up  |
If you would like to establish a default availability zone, you can set the this parameter in cinder.conf on the cinder-api nodes:
[DEFAULT]
default_availability_zone=AZ1
This instructs Cinder which availability zone to use if the API call did not specify one. If you don’t, it will use a hardcoded default, nova. In the case of our example, where we&8217;ve set the default availability zone in Nova to AZ1, this would result in a failure. This also means that unlike Nova, users do not have the flexibility of omitting availability zone information and expecting that Cinder will select any available backend with spare capacity in any availability zone.
Therefore, you have a choice with this parameter. You can set it to one of your availability zones so API calls without availability zone information don’t fail, but causing a potential situation of uneven storage allocation across your availability zones. Or, you can not set this parameter, and accept that user API calls that forget or omit availability zone info will fail.
Another option is to set the default to a non-existent availability zone you-must-specify-an-AZ or something similar, so when the call fails due to the non-existant availability zone, this information will be included in the error message sent back to the client.
Your storage backends, storage drivers, and storage architecture may also affect how you set up your availability zones. If we are using the reference Cinder LVM ISCSI Driver deployed on commodity hardware, and that hardware fits the same availability zone criteria of our computes, then we could setup availability zones to match what we have defined in Nova. We could also do the same if we had a third party storage appliance in each availability zone, e.g.:
|     Binary    |           Host          | Zone | Status  | State |
| cinder-volume | hostname1@StorageArray1 |  AZ1 | enabled |   up  |
| cinder-volume | hostname2@StorageArray2 |  AZ2 | enabled |   up  |
(Note: Notice that the hostnames (hostname1 and hostname2) are still different in this example. The cinder multi-backend feature allows us to configure multiple storage backends in the same cinder.conf (for the same cinder-volume service), but Cinder availability zones can only be defined per cinder-volume service, and not per-backend per-cinder-volume service. In other words, if you define multiple backends in one cinder.conf, they will all inherit the same availability zone.)
However, in many cases if you’re using a third party storage appliance, then these systems usually have their own built-in redundancy that exist outside of OpenStack notions of availability zones. Similarly if you use a distributed storage solution like Ceph, then availability zones have little or no meaning in this context. In this case, you can forgo Cinder availability zones.
The one issue in doing this, however, is that any availability zones you defined for Nova won’t match. This can cause problems when Nova makes API calls to Cinder &; for example, when performing a Boot from Volume API call through Nova. If Nova decided to provision your VM in AZ1, it will tell Cinder to provision a boot volume in AZ1, but Cinder doesn’t know anything about AZ1, so this API call will fail. To prevent this from happening, we need to set the following parameter in cinder.conf on your nodes running cinder-api:
[DEFAULT]
allow_availability_zone_fallback=True
This parameter prevents the API call from failing, because if the requested availability zone does not exist, Cinder will fallback to another availability zone (whichever you defined in default_availability_zone parameter, or in storage_availability_zone if the default is not set). The hardcoded default storage_availability_zone is nova, so the fallback availability zone should match the default availability zone for your cinder-volume services, and everything should work.
The easiest way to solve the problem, however is to remove the AvailabilityZoneFilter from your filter list in cinder.conf on nodes running cinder-scheduler. This makes the scheduler ignore any availability zone information passed to it altogether, which may also be helpful in case of any availability zone configuration mismatch.
Neutron
Availability zone support was added to Neutron in the Mitaka release. Availability zones can be set for DHCP and L3 agents in their respective configuration files:
[AGENT]
Availability_zone = AZ1
Restart the agents, and confirm availability zone settings as follows:
neutron agent-show <agent-id>
+———————+————+
| Field               | Value      |
+———————+————+
| availability_zone   | AZ1        |

If you would like to establish a default availability zone, you can set the this parameter in neutron.conf on neutron-server nodes:
[DEFAULT]
default_availability_zones=AZ1,AZ2
This parameters tells Neutron which availability zones to use if the API call did not specify any. Unlike Cinder, you can specify multiple availability zones, and leaving it undefined places no constraints in scheduling, as there are no hard coded defaults. If you have users making API calls that do not care about the availability zone, then you can enumerate all your availability zones for this parameter, or simply leave it undefined &8211; both would yield the same result.
Additionally, when users do specify an availability zone, such requests are fulfilled as a “best effort” in Neutron. In other words, there is no need for an availability zone fallback parameter, because your API call still execute even if your availability zone hint can’t be satisfied.
Another important distinction that sets Neutron aside from Nova and Cinder is that it implements availability zones as scheduler hints, meaning that on the client side you can repeat this option to chain together multiple availability zone specifications in the event that more than one availability zone would satisfy your availability criteria. For example:
$ neutron net-create –availability-zone-hint AZ1 –availability-zone-hint AZ2 new_network
As with Cinder, the Neutron plugins and backends you’re using deserve attention, as the support or need for availability zones may be different depending on their implementation. For example, if you’re using a reference Neutron deployment with the ML2 plugin and with DHCP and L3 agents deployed to commodity hardware, you can likely place these agents consistently according to the same availability zone criteria used for your computes.
Whereas in contrast, other alternatives such as the Contrail plugin for Neutron do not support availability zones. Or if you are using Neutron DVR for example, then availability zones have limited significance for Layer 3 Neutron.
OpenStack Project availability zone Comparison Summary
Before we move on, it&8217;s helpful to review how each project handles availability zones.

Nova
Cinder
Neutron

Default availability zone scheduling
Can set to one availability zone or None
Can set one availability zone; cannot set None
Can set to any list of availability zones or none

Availability zone fallback
None supported
Supported through configuration
N/A; scheduling to availability zones done on a best effort basis

Availability zone definition restrictions
No more than availability zone per nova-compute
No more than 1 availability zone per cinder-volume
No more than 1 availability zone per neutron agent

Availability zone client restrictions
Can specify one availability zone or none
Can specify one availability zone or none
Can specify an arbitrary number of availability zones

Availability zones typically used when you have &;
Commodity HW for computes, libvirt driver
Commodity HW for storage, LVM iSCSI driver
Commodity HW for neutron agents, ML2 plugin

Availability zones not typically used when you have&8230;
Third party hypervisor drivers that manage their own HA for VMs (DRS for VCenter)
Third party drivers, backends, etc. that manage their own HA
Third party plugins, backends, etc. that manage their own HA

Best Practices for availability zones
Now let&8217;s talk about how to best make use of availability zones.
What should my availability zones represent?
The first thing you should do as a cloud operator is to nail down your own accepted definition of an availability zone and how you will use them, and remain consistent. You don’t want to end up in a situation where availability zones are taking on more than one meaning in the same cloud. For example:
Fred’s AZ            | Example of AZ used to perform tenant workload isolation
VMWare cluster 1 AZ | Example of AZ used to select a specific hypervisor type
Power source 1 AZ   | Example of AZ used to select a specific failure domain
Rack 1 AZ           | Example of AZ used to select a specific failure domain
Such a set of definitions would be a source of inconsistency and confusion in your cloud. It’s usually better to keep things simple with one availability zone definition, and use OpenStack features such as Nova Flavors or Nova/Cinder boot hints to achieve other requirements for multi-tenancy isolation, ability to select between different hypervisor options and other features, and so on.
Note that OpenStack currently does not support the concept of multiple failure domain levels/tiers. Even though we may have multiple ways to define failure domains (e.g., by power circuit, rack, room, etc), we must pick a single convention.
For the purposes of this article, we will discuss availability zones in the context of failure domains. However, we will cover one other use for availability zones in the third section.
How many availability zones do I need?
One question that customers frequently get hung up on is how many availability zones they should create. This can be tricky because the setup and management of availability zones involves stakeholders at every layer of the solution stack, from tenant applications to cloud operators, down to data center design and planning.

A good place to start is your cloud application requirements: How many failure domains are they designed to work with (i.e. redundancy factor)? The likely answer is two (primary + backup), three (for example, for a database or other quorum-based system), or one (for example, a legacy app with single points of failure). Therefore, the vast majority of clouds will have either 2, 3, or 1 availability zone.
Also keep in mind that as a general design principle, you want to minimize the number of availability zones in your environment, because the side effect of availability zone proliferation is that you are dividing your capacity into more resource islands. The resource utilization in each island may not be equal, and now you have an operational burden to track and maintain capacity in each island/availability zone. Also, if you have a lot of availability zones (more than the redundancy factor of tenant applications), tenants are left to guess which availability zones to use and which have available capacity.
How do I organize my availability zones?
The value proposition of availability zones is that tenants are able to achieve a higher level of availability in their applications. In order to make good on that proposition, we need to design our availability zones in ways that mitigate single points of failure.             
For example, if our resources are split between two power sources in the data center, then we may decide to define two resource pools (availability zones) according to their connected power source:
Or, if we only have one TOR switch in our racks, then we may decide to define availability zones by rack. However, here we can run into problems if we make each rack its own availability zone, as this will not scale from a capacity management perspective for more than 2-3 racks/availability zones (because of the &;resource island&; problem). In this case, you might consider dividing/arranging your total rack count into into 2 or 3 logical groupings that correlate to your defined availability zones:
We may also find situations where we have redundant TOR switch pairs in our racks, power source diversity to each rack, and lack a single point of failure. You could still place racks into availability zones as in the previous example, but the value of availability zones is marginalized, since you need to have a double failure (e.g., both TORs in the rack failing) to take down a rack.
Ultimately with any of the above scenarios, the value added by your availability zones will depend on the likelihood and expression of failure modes &8211; both the ones you have designed for, and the ones you have not. But getting accurate and complete information on such failure modes may not be easy, so the value of availability zones from this kind of unplanned failure can be difficult to pin down.
There is however another area where availability zones have the potential to provide value &8212; planned maintenance. Have you ever needed to move, recable, upgrade, or rebuild a rack? Ever needed to shut off power to part of the data center to do electrical work? Ever needed to apply disruptive updates to your hypervisors, like kernel or QEMU security updates? How about upgrades of OpenStack or the hypervisor operating system?
Chances are good that these kinds of planned maintenance are a higher source of downtime than unplanned hardware failures that happen out of the blue. Therefore, this type of planned maintenance can also impact how you define your availability zones. In general the grouping of racks into availability zones (described previously) still works well, and is the most common availability zone paradigm we see for our customers.
However, it could affect the way in which you group your racks into availability zones. For example, you may choose to create availability zones based on physical parameters like which floor, room or building the equipment is located in, or other practical considerations that would help in the event you need to vacate or rebuild certain space in your DC(s). Ex:
One of the limitations of the OpenStack implementation of availability zones made apparent in this example is that you are forced to choose one definition. Applications can request a specific availability zone, but are not offered the flexibility of requesting building level isolation, vs floor, room, or rack level isolation. This will be a fixed, inherited property of the availability zones you create. If you need more, then you start to enter the realm of other OpenStack abstractions like Regions and Cells.
Other uses for availability zones?
Another way in which people have found to use availability zones is in multi-hypervisor environments. In the ideal world of the “implementation-agnostic” cloud, we would abstract such underlying details from users of our platform. However there are some key differences between hypervisors that make this aspiration difficult. Take the example of KVM & VMWare.
When an iSCSI target is provisioned through with the LVM iSCSI Cinder driver, it cannot be attached to or consumed by ESXi nodes. The provision request must go through VMWare’s VMDK Cinder driver, which proxies the creation and attachment requests to vCenter. This incompatibility can cause errors and thus a negative user experience issues if the user tries to attach a volume to their hypervisor provisioned from the wrong backend.
To solve this problem, some operators use availability zones as a way for users to select hypervisor types (for example, AZ_KVM1, AZ_VMWARE1), and set the following configuration in their nova.conf:
[cinder]
cross_az_attach = False
This presents users with an error if they attempt to attach a volume from one availability zone (e.g., AZ_VMWARE1) to another availability zone (e.g., AZ_KVM1). The call would have certainly failed regardless, but with a different error from farther downstream from one of the nova-compute agents, instead of from nova-api. This way, it&8217;s easier for the user to see where they went wrong and correct the problem.
In this case, the availability zone also acts as a tag to remind users which hypervisor their VM resides on.
In my opinion, this is stretching the definition of availability zones as failure domains. VMWare may be considered its own failure domain separate from KVM, but that’s not the primary purpose of creating availability zones this way. The primary purpose is to differentiate hypervisor types. Different hypervisor types have a different set of features and capabilities. If we think about the problem in these terms, there are a number of other solutions that come to mind:

Nova Flavors: Define a “VMWare” set of flavors that map to your VCenter cluster(s). If your tenants that use VMWare are different than your tenants who use KVM, you can make them private flavors, ensuring that tenants only ever see or interact with the hypervisor type they expect.
Cinder: Similarly for Cinder, you can make the VMWare backend private to specific tenants, and/or set quotas differently for that backend, to ensure that tenants will only allocate from the correct storage pools for their hypervisor type.
Image metadata: You can tag your images according to the hypervisor they run on. Set image property hypervisor_type to vmware for VMDK images, and to qemu for other images. The ComputeCapabilitiesFilter in Nova will honor the hypervisor placement request.

Soo… Are availability zones right for me?
So wrapping up, think of availability zones in terms of:

Unplanned failures: If you have a good history of failure data, well understood failure modes, or some known single point of failure that availability zones can help mitigate, then availability zones may be a good fit for your environment.
Planned maintenance: If you have well understood maintenance processes that have awareness of your availability zone definitions and can take advantage of them, then availability zones may be a good fit for your environment. This is where availability zones can provide some of the creates value, but is also the most difficult to achieve, as it requires intelligent availability zone-aware rolling updates and upgrades, and affects how data center personnel perform maintenance activities.
Tenant application design/support: If your tenants are running legacy apps, apps with single points of failure, or do not support use of availability zones in their provisioning process, then availability zones will be of no use for these workloads.
Other alternatives for achieving app availability: Workloads built for geo-redundancy can achieve the same level of HA (or better) in a multi-region cloud. If this were the case for your cloud and your cloud workloads, availability zones would be unnecessary.
OpenStack projects: You should factor into your decision the limitations and differences in availability zone implementations between different OpenStack projects, and perform this analysis for all OpenStack projects in your scope of deployment.

The post The first and final words on OpenStack availability zones appeared first on Mirantis | Pure Play Open Cloud.
Quelle: Mirantis

Evolving The Mirantis Brand

The post Evolving The Mirantis Brand appeared first on Mirantis | The Pure Play OpenStack Company.
A while back, I shared some of the colorful history of Mirantis swag. Ever since our early days as OpenStack pioneers, our brand has been important to us. It’s helped us stand out among a noisy OpenStack ecosystem and get noticed among much larger organizations. Our booths at OpenStack Summits are always engaging points of interest: a bar, picture search, driving simulator, and robotic hockey table. And our employee swag is over-the-top: jean jackets; track suits; and in Vancouver, hockey jerseys that the OpenStack community still talks about today.

Our brand is also a source of pride, a reflection of our collective spirit, and a testament to our commitment to our ideals. We believe in Pure Play Open Cloud, and many of our earliest t-shirt designs reflected this value, using analogies to Ikea and Breaking Bad. We have celebrated the open source communities in which we participate, with our Zelda-inspired App Catalog shirt, Japan’s Megarantis (“defender of pure openstack!”) and Austin’s OpenStack release beer labels. We’ve even embraced our global company culture by lovingly poking fun at Russian stereotypes.

As our company has evolved, so has our brand. This year, we renewed our focus on managed services, and embraced Kubernetes to help us deliver those services continuously. As we make this shift, our branding team is responding. And this time, we felt the time was right for our company evolution to extend to our logo.

Our logo evolution

Any time a company decides to change its logo or name, it’s a decision that can’t be taken lightly or entered into hastily. Our old logo and color palette served us well for quite a while, and many people may not know that it has already undergone some revisions over time:

The Mirantis logo is more than a decade old, and its original designer and meaning have become company folklore. Personally, it reminds me of the variable x &; i.e., when working with customers, Mirantis “solves for x” and builds a cloud that meets their unique needs. Others think that the black swoosh looks like two “helping hands”. Some believe the arrow represents the upward trajectory of our customers’ business results.

Regardless of interpretation, the logo mark presented several design challenges. The arrow is really skinny. The white cutouts in the black swoosh are noisy, particularly when the logo is a smaller size. The swooshes are different widths, increasing the horizontal bias of the mark. The swooshes don’t align with the edges of the “Mirantis” word mark below. Lastly, the typeface of the word mark includes an “M” that has angled legs, contributing to the uneven or irregular shape of the combined logo and word marks.

When we started work on our new logo, all options were on the table. We considered a complete overhaul, and fully developed a few radically different alternatives. Ultimately, we decided that despite its flaws, our logo has a distinctive, recognizable shape that has considerable cachet in our market. So, we went with an evolutionary approach:

The new logo streamlines our previous logo’s appearance, improves its suitability for printed materials, and imbues it with a more contemporary visual appearance. We also updated the typeface used for “Mirantis” to increase its readability, improve its interplay with the logo mark, and to make it more compatible with the fonts included in our brand’s current visual identity. Our color palette is refreshed with a more intense primary red, a new “icy teal” complementary color, and a deep, warm plum for dark color fills and backgrounds. (You’ll notice that these colors have become staples on our website, as well.)

If you need any of our visual assets or color values, you can find them on our logo page.

We at Mirantis like to have fun with our brand while promoting our values: pure play, openness, and community; and we’re definitely not going to stop having fun while delivering those messages. In the coming months, we’ll extend our branding to our booth at OpenStack Summit Boston in May, and for several other open source events. We hope you’ll come say hi and check out our latest swag.

Connect With Mirantis
If you’re not already connected with us, we encourage you to sign up for our newsletter and follow us on your preferred social networks:

The post Evolving The Mirantis Brand appeared first on Mirantis | The Pure Play OpenStack Company.
Quelle: Mirantis

A dash of Salt(Stack): Using Salt for better OpenStack, Kubernetes, and Cloud — Q&A

The post A dash of Salt(Stack): Using Salt for better OpenStack, Kubernetes, and Cloud &; Q&;A appeared first on Mirantis | The Pure Play OpenStack Company.
On January 16, Ales Komarek presented an introduction to Salt. We covered the following topics:

The model-driven architectures behind how Salt stores topologies and workflows

How Salt provides solution adaptability for any custom workloads

Infrastructure as Code: How Salt provides not only configuration management, but entire life-cycle management

How Continuous Delivery/ Integration/ Management fits into the puzzle

How Salt manages and scales parallel cloud deployments that include OpenStack, Kubernetes and others

What we didn&;t do, however, is get to all of the questions from the audience, so here&8217;s a written version of the Q&A, including those we didn&8217;t have time for.
Q: Why Salt?
A: It&8217;s python, it has a huge and growing base of imperative modules and declarative states, and it has a good message bus.
Q: What tools are used to initially provision Salt across an infrastructure? Cobbler, Puppet, MAAS?
A: To create a new deployment, we rely on a single node, where we bootstrap the Salt master and Metal-as-a-Service (formerly based on Foreman, now Ironic). Then we control the MaaS service to deploy the physical bare-metal nodes.
Q: How broad a range of services do you already have recipes for, and how easy is it to write and drop in new ones if you need one that isn&8217;t already available?
A: The ecosystem is pretty vast. You can look at either https://github.com/tcpcloud or the formula ecosystem overview at http://openstack-salt.tcpcloud.eu/develop/extending-ecosystem.html. There are also guidelines for creating new formulas, which is very straight-forward process. A new service can be created in matter of hours, or even minutes.
Q: Can you convert your existing Puppet/Ansible scripts to Salt, and what would I search to find information about that?
A: Yes, we have reverse engineered autmation for some of these services in the past. For example we were deeply inspired by the Ansible module for Gerrit resource management.  You can find some information on creating Salt Formulas at https://docs.saltstack.com/en/latest/topics/development/conventions/formulas.html,  and we will be adding tutorial material here on this blog in the near future.
Q: Is there a NodeJS binding available?
A: If you meant the NodeJS formula to setup a NodeJS enironment, yes, there is such a formula. If you mean bindings to the system, you can use the Salt API to integrate NodeJS with Salt.
Q: Have you ever faced performance issues when storing a lot of data in pillars?
A: We have not faced performance issues with pillars that are deliverd by reclass ENC. It has been tested up to a few thousands of nodes.
Q: What front end GUI is typically used with Salt monitoring (e.g., Kibana, Grafana,&;)?
A: Salt monitoring uses Sensu or StackLight for the actual functional monitoring checks. It uses Kibana to display events stored in Elasticsearch and Grafana to visualize metrics coming from time-series databases such as Graphite or Influx.
Q: What is the name of the salt PKI manager? (Or what would I search for to learn more about using salt for infrastructure-wide PKI management?)
A: The PKI feature is well documented in the Salt docs, and is available at https://docs.saltstack.com/en/latest/ref/states/all/salt.states.x509.html.
Q: Can I practice installing and deploying SaltStack on my laptop? Can you recommend a link?
A: I&8217;d recommend you have a look at http://openstack-salt.tcpcloud.eu/develop/quickstart-vagrant.html where you can find a nice tutorial on how to setup a simple infrastructure.
Q: Thanks for the presentation! Within Heat, I&8217;ve only ever seen salt used in terms of software deployments. What we&8217;ve seen today, however, goes clear through to service, resource, and even infrastructure deployment! In this way, does Salt become a viable alternative to Heat? (I&8217;m trying to understand where the demarcation is between the two now.)
A: Think of Heat as part of the solution responsible for spinning up the harware resources such as networks, routers and servers, in a way that is similar to MaaS, Ironic or Foreman. Salt&8217;s part begins where Heat&8217;s part ends &; after the resources are started, Salt takes over and finishes the installation/configuration process.
Q: When you mention Orchestration, how does salt differentiate from Heat, or is Salt making Heat calls?
A: Heat is more for hardware resources orchestration. It has some capability to do software configuration, but rather limited. We have created heat resources that help to classify resources on fly. We also have salt heat modules capable of running a heat stack.
Q: Will you be showing any parts of SaltStack Enterprise, or only FREE Salt Open Source? Do you use Salt in Multi-Master deployment?
A: We are using the opensource version of SaltStack, the enterprise gets little gain given the pricing model. In some deployments, we use the salt master HA deployment setups.
Q: What HA engine is typically used for the Salt master?
A: We use 2 separate masters with shared storage provided by GlusterFS on which the master&8217;s and minions&8217; keys are stored.
Q: Is there a GUI ?
A: The creation of a GUI is currently under discussion.
Q: How do you enforce Role Based Administration in the Salt Master? Can you segregate users to specific job roles and limit which jobs they can execute in Salt?
A: We use the ACLs of the Salt master to limit the user&8217;s options. This also applies for the Jenkins-powered pipelines, which we also manage by Salt, both on the job and the user side.
Q: Can you show the salt files (.sls, pillar, &8230;)?
A: You can look at the github for existing formulas at https://github.com/tcpcloud and good example of pillars can be found at https://github.com/Mirantis/mk-lab-salt-model/.
Q: Is there a link for deploying Salt for Kubernetes? Any best practices guide?
A: The best place to look is the https://github.com/openstack/salt-formula-kubernetes README.
Q: Is SaltStack the same as what&8217;s on saltstack.com, or is it a different project?
A: These are the same project. Saltstack.com is company that is behind the Salt technology and provides support and enterprise versions.
Q: So far this looks like what Chef can do. Can you make a comparison or focus on the &;value add&; from Salt that Chef or Puppet don&8217;t give you?
A: The replaceability/reusability of the individual components is very easy, as all formulas are &;aware&8217; of the rest and share a common form and single dependency tree. This is a problem with community-based formulas in either of the other tools, as they are not very compatible with each other.
Q: In terms of purpose, is there any difference between SaltStack vs Openstack?
A: Apart from the fact that SaltStack can install OpenStack, it can also provide virtualization capabilities. However, Salt has very limited options, while OpenStack supports complex production level scenarios.
Q: Great webinar guys. Ansible seems to have a lot of traction as means of deploying OpenStack. Could you compare/contrast with SaltStack in this context?
A: With Salt, the OpenStack services are just part of wider ecosystem; the main advantage comes from the consistency across all services/formulas, the provision of support metadata to provide documentation or monitoring features.
Q: How is Salt better than Ansible/Puppet/Chef ?
A: The biggest difference is the message bus, which lets you control, and get data from, the infrastructure with great speed and concurrency.
Q: Can you elaborate mirantis fuel vs saltstack?
Fuel is an open source project that was (and is) designed to deploy OpenStack from a single ISO-based artifact, and to provide various lifecycle management functions once the cluster has been deployed. SaltStack is designed to be more granular, working with individual components or services.
Q: Are there plans to integrate SaltStack in to MOS?
A: The Mirantis Cloud Platform (MCP) will be powered by Salt/Reclass.
Q: Is Fuel obsolete or it will use Salt in the background instead of Puppet?
A: Fuel in its current form will continue to be used for deploying Mirantis OpenStack in the traditional manner (as a single ISO file). We are extending our portfolio of life cycle management tools to include appropriate technologies for deploying and managing open source software in MCP. For example, Fuel CCP will be used to deploy containerized OpenStack on Kubernetes. Similarly, Decapod will be used to deploy Ceph. All of these lifecycle management technologies are, in a sense, Fuel. Whether a particular tool uses Salt or Puppet will depend on what it&8217;s doing.
Q: MOS 10 release date?
A: We&8217;re still making plans on this.
Thanks for joining us, or if you missed it, please go ahead and view the webinar.
The post A dash of Salt(Stack): Using Salt for better OpenStack, Kubernetes, and Cloud &8212; Q&038;A appeared first on Mirantis | The Pure Play OpenStack Company.
Quelle: Mirantis