OpenShift Cluster Node Tuning Operator – Nodes on steroids

What do you prefer: manual or automatic transmissions?
I like to have the control over a car which a manual transmission provides – using the engine to slow down without brakes and being more efficient when overtaking. On the other hand, it’s nice not to involve the left leg all of the time and to keep both hands on the steering wheel. Using an automatic transmission in general is easier and my family prefers it, so I have no choice.
Wouldn’t it be great to be able to do things more efficiently and precisely but not do it in a manual way? It would be great if an automatic transmission always behaved as I wanted and needed at that exact moment.
Returning to an OpenShift scenario I’ll ask again:
Wouldn’t it be great to tweak my RHEL CoreOS node only when I need to, and to not have to do it manually?
You can do this using the OpenShift Cluster Node Tuning Operator. This operator gives the user an interface to add custom tuning to apply to nodes on specified conditions and to configure the kernel according to the user’s needs. More information can be found at github.
The Node Tuning Operator runs as a daemonset on every node in the cluster. Check it with the command:
$ oc get pods -n openshift-cluster-node-tuning-operator -o wide

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cluster-node-tuning-operator-847984d77-f92tv 1/1 Running 0 6h51m 10.128.0.17 skordas0813-6p5bl-master-1 <none> <none>
tuned-2gz29 1/1 Running 0 6h51m 10.0.0.4 skordas0813-6p5bl-master-1 <none> <none>
tuned-5hkmr 1/1 Running 0 6h51m 10.0.0.7 skordas0813-6p5bl-master-2 <none> <none>
tuned-5jv59 1/1 Running 0 6h50m 10.0.32.4 skordas0813-6p5bl-worker-centralus1-tkbxs <none> <none>
tuned-gvlnt 1/1 Running 0 6h50m 10.0.32.5 skordas0813-6p5bl-worker-centralus3-nrh4t <none> <none>
tuned-nvfb5 1/1 Running 0 6h51m 10.0.0.6 skordas0813-6p5bl-master-0 <none> <none>
tuned-xhpfx 1/1 Running 0 6h49m 10.0.32.6 skordas0813-6p5bl-worker-centralus2-xm865 <none> <none>

Also, you can check the tuned custom resources:
$ oc get tuned -n openshift-cluster-node-tuning-operator
NAME AGE
default 5h31m

Let’s take a closer look at the default tuning
$ oc get tuned -o yaml -n openshift-cluster-node-tuning-operator

apiVersion: v1
items:
– apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
creationTimestamp: “2019-08-07T14:08:10Z”
generation: 1
name: default
namespace: openshift-cluster-node-tuning-operator
resourceVersion: “6878”
selfLink: /apis/tuned.openshift.io/v1/namespaces/openshift-cluster-node-tuning-operator/tuneds/default
uid: c9f0361b-b91c-11e9-931e-000d3a9420dc
spec:
profile:
– data: |
[main]
summary=Optimize systems running OpenShift (parent profile)
include=${f:virt_check:virtual-guest:throughput-performance}

[selinux]
avc_cache_threshold=8192

[net]
nf_conntrack_hashsize=131072

[sysctl]
net.ipv4.ip_forward=1
kernel.pid_max=>131072
net.netfilter.nf_conntrack_max=1048576
net.ipv4.neigh.default.gc_thresh1=8192
net.ipv4.neigh.default.gc_thresh2=32768
net.ipv4.neigh.default.gc_thresh3=65536
net.ipv6.neigh.default.gc_thresh1=8192
net.ipv6.neigh.default.gc_thresh2=32768
net.ipv6.neigh.default.gc_thresh3=65536

[sysfs]
/sys/module/nvme_core/parameters/io_timeout=4294967295
/sys/module/nvme_core/parameters/max_retries=10
name: openshift
– data: |
[main]
summary=Optimize systems running OpenShift control plane
include=openshift

[sysctl]
# ktune sysctl settings, maximizing i/o throughput
#
# Minimal preemption granularity for CPU-bound tasks:
# (default: 1 msec# (1 + ilog(ncpus)), units: nanoseconds)
kernel.sched_min_granularity_ns=10000000
# The total time the scheduler will consider a migrated process
# “cache hot” and thus less likely to be re-migrated
# (system default is 500000, i.e. 0.5 ms)
kernel.sched_migration_cost_ns=5000000
# SCHED_OTHER wake-up granularity.
#
# Preemption granularity when tasks wake up. Lower the value to
# improve wake-up latency and throughput for latency critical tasks.
kernel.sched_wakeup_granularity_ns=4000000
name: openshift-control-plane
– data: |
[main]
summary=Optimize systems running OpenShift nodes
include=openshift

[sysctl]
net.ipv4.tcp_fastopen=3
fs.inotify.max_user_watches=65536
name: openshift-node
– data: |
[main]
summary=Optimize systems running ES on OpenShift control-plane
include=openshift-control-plane

[sysctl]
vm.max_map_count=262144
name: openshift-control-plane-es
– data: |
[main]
summary=Optimize systems running ES on OpenShift nodes
include=openshift-node

[sysctl]
vm.max_map_count=262144
name: openshift-node-es
recommend:
– match:
– label: tuned.openshift.io/elasticsearch
match:
– label: node-role.kubernetes.io/master
– label: node-role.kubernetes.io/infra
type: pod
priority: 10
profile: openshift-control-plane-es
– match:
– label: tuned.openshift.io/elasticsearch
type: pod
priority: 20
profile: openshift-node-es
– match:
– label: node-role.kubernetes.io/master
– label: node-role.kubernetes.io/infra
priority: 30
profile: openshift-control-plane
– priority: 40
profile: openshift-node
status: {}
kind: List
metadata:
resourceVersion: “”
selfLink: “”

The section spec.profiles is a list of profile definitions in which we define the names and values that the operator will set for the node. It is possible to define a child profile that you only want to use in other profiles using the include key. In the example above, the openshift profile is an example of this. We can also add a summary to describe the profile.
The spec.recommend section is a list of profile selection logic checks what conditions should be met for the operator to apply the correct profile on the node. This part may not be so obvious, so let’s look deeper.
Each check needs three pieces of information:
match – What conditions need to be met to apply the recommended profile? If the match part is omitted, then the operator assumes that the match is always true. More details below.
priority – smaller numbers are higher priority. If there is more than one profile that should be used, then the Node Tuning Operator will apply the profile with a higher priority.
profile – name of the profile from spec.profiles that should be used.
If you want to apply more than one profile at the same time, you need to create a new profile that will include other profiles.
What criteria needs to be met to apply a specific profile? Everything is managed by labels on nodes and pods. All conditions are in the match section.
Each match can have four definitions:
label – node or pod label.
value – node or pad label value – if it’s omitted, then operator will match on the existence of the label.
type – only node or pod values – It defines what label the operator should check. If it is omitted then the operator will check the node label.
match – type is array – nested additional matches – the operator will check this match only when the toplevel match returns true.
Reading the recommend section is much easier now. Let’s move on to the default recommendation. The operator will check each node independently to determine which profile should be used on which node.
– match:
– label: tuned.openshift.io/elasticsearch
match:
– label: node-role.kubernetes.io/master
– label: node-role.kubernetes.io/infra
type: pod
priority: 10
profile: openshift-control-plane-es

At the beginning, the operator will check if the node has a pod running on it with the tuned.openshift.io/elasticsearch label. If this match is true, it will check the nested match: If the node (node is implied – because type is omitted) has the labels node-role.kubernetes.io/master or node-role.kubernetes.io/infra, the operator will apply the openshift-control-plane-es profile because it is a control plane or infra node running an elasticsearch pod.
If this second control plane/infra match is false, then the operator will move on and check the next match with lower priority:
– match:
– label: tuned.openshift.io/elasticsearch
type: pod
priority: 20
profile: openshift-node-es

openshift-node-es profile will be applied only when the previous control plane/infra match returns false and the node is running a pod with thetuned.openshift.io/elasticsearch label.
As before, if there is no match we continue to the next match in priority:
– match:
– label: node-role.kubernetes.io/master
– label: node-role.kubernetes.io/infra
priority: 30
profile: openshift-control-plane

openshift-control-plane profile will be applied only when the previous matches return false and the node is labeled node-role.kubernetes.io/masteror node-role.kubernetes.io/infra
Finally, if there were no matches by this point, the operator will apply openshift-node profile:
– priority: 40
profile: openshift-node

Because there is no match array, it is always true.
Now we can create our own profile:

Create a file with CustomResource: cool_app_ip_port_range.yaml

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
name: ports
namespace: openshift-cluster-node-tuning-operator
spec:
profile:
– data: |
[main]
summary=A custom profile to extend local port range

[sysctl]
net.ipv4.ip_local_port_range=”1024 65535″

name: port-range

recommend:
– match:
– label: cool-app
value: extended-range
type: pod
priority: 25
profile: port-range

Create new tuned and verify if is there

$ oc create -f cool_app_ip_port_range.yaml
tuned.tuned.openshift.io/ports created
$ oc get tuned -n openshift-cluster-node-tuning-operator
NAME AGE
default 6h32m
ports 31s

Let’s check the value of net.ipv4.ip_local_port_range on each node:

for i in $(oc get nodes –no-headers -o=custom-columns=NAME:.metadata.name); do echo $i; oc debug node/$i — chroot /host sysctl net.ipv4.ip_local_port_range; done

In my case each node has the same range:
net.ipv4.ip_local_port_range = 32768 60999

Create our own app and label it correctly

$ oc new-project my-cool-project
$ oc new-app django-psql-example
$ oc get pods -o wide -n my-cool-project | grep Running
django-psql-example-1-pgd67 1/1 Running 0 3m15s 10.128.2.10 skordas0813-6p5bl-worker-centralus3-nrh4t <none> <none>
postgresql-1-cw86k 1/1 Running 0 5m12s 10.131.0.14 skordas0813-6p5bl-worker-centralus1-tkbxs <none> <none>
$ oc label pod postgresql-1-cw86k -n my-cool-project cool-app=
$ oc label pod django-psql-example-1-pgd67 -n my-cool-project cool-app=extended-range

Check net.ipv4.ip_local_port_range once again on each node:

for i in $(oc get nodes –no-headers -o=custom-columns=NAME:.metadata.name); do echo $i; oc debug node/$i — chroot /host sysctl net.ipv4.ip_local_port_range; done

On node skordas0813-6p5bl-worker-centralus3-nrh4t the value of net.ipv4.ip_local_port_range has been changed
net.ipv4.ip_local_port_range = 1024 65535

because a pod labeled cool-app=extended-range is running on this node!
If you change the matching label or just delete pod, project or ‘port’ tuned profile, then the range will be set back to the default kernel values.
Everything is managed by the OpenShift Cluster Node Tuning Operator and the profiles you use, so you don’t need to tweak any values on the nodes’ operating system. This results in an automatic transmission-like experience for operators of OpenShift.
The post OpenShift Cluster Node Tuning Operator – Nodes on steroids appeared first on Red Hat OpenShift Blog.
Quelle: OpenShift

Published by