Ways of distributing pods across nodes
Tue 02 February 2021 by adminHow can more evenly distribute pod across nodes ? After quick research I found that this example of deployment should be ok:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
selector:
matchLabels:
app: nginx
replicas: 3
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:latest
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- nginx
topologyKey: "kubernetes.io/hostname"
In this example we instruct kubernetes to not schedule pod when another pod of app=nginx
exist on selected node. This is soft limit because of preferredDuringSchedulingIgnoredDuringExecution
, so when there is no way to fullfill this requirement than you can break this rule. So lets check it out, I will create 3 node cluster using kind
with config:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
$ kind create cluster --config kind.yaml
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
kind-control-plane Ready control-plane,master 85m v1.20.2
kind-worker Ready <none> 85m v1.20.2
kind-worker2 Ready <none> 85m v1.20.2
# remove taints from kind-control-plane
$ for i in $(seq 0 4); do kubectl apply -f test.yaml && kubectl wait --for=condition=available --timeout=600s deployment/nginx && kubectl get pod -o=custom-columns=NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName -l app=nginx && kubectl delete -f test.yaml && kubectl wait --for=delete --timeout=600s pod -l app=nginx; done
deployment.apps/nginx created
deployment.apps/nginx condition met
NAME STATUS NODE
nginx-677d446dfc-6grxk Running kind-worker2
nginx-677d446dfc-hdmr2 Running kind-worker
nginx-677d446dfc-q6g4d Running kind-control-plane
deployment.apps "nginx" deleted
pod/nginx-677d446dfc-6grxk condition met
pod/nginx-677d446dfc-hdmr2 condition met
pod/nginx-677d446dfc-q6g4d condition met
deployment.apps/nginx created
deployment.apps/nginx condition met
NAME STATUS NODE
nginx-677d446dfc-5v9jc Running kind-worker2
nginx-677d446dfc-gx8r4 Running kind-control-plane
nginx-677d446dfc-lf579 Running kind-worker
deployment.apps "nginx" deleted
pod/nginx-677d446dfc-5v9jc condition met
pod/nginx-677d446dfc-gx8r4 condition met
pod/nginx-677d446dfc-lf579 condition met
deployment.apps/nginx created
deployment.apps/nginx condition met
NAME STATUS NODE
nginx-677d446dfc-8rj4t Running kind-worker
nginx-677d446dfc-9rxc7 Running kind-worker2
nginx-677d446dfc-m5xsh Running kind-control-plane
deployment.apps "nginx" deleted
pod/nginx-677d446dfc-8rj4t condition met
pod/nginx-677d446dfc-9rxc7 condition met
pod/nginx-677d446dfc-m5xsh condition met
deployment.apps/nginx created
deployment.apps/nginx condition met
NAME STATUS NODE
nginx-677d446dfc-8xtwx Running kind-worker
nginx-677d446dfc-jsn86 Running kind-control-plane
nginx-677d446dfc-mqhkx Running kind-worker2
deployment.apps "nginx" deleted
pod/nginx-677d446dfc-8xtwx condition met
pod/nginx-677d446dfc-jsn86 condition met
pod/nginx-677d446dfc-mqhkx condition met
deployment.apps/nginx created
deployment.apps/nginx condition met
NAME STATUS NODE
nginx-677d446dfc-2hzcw Running kind-control-plane
nginx-677d446dfc-mdsgg Running kind-worker2
nginx-677d446dfc-rf6kx Running kind-worker
deployment.apps "nginx" deleted
pod/nginx-677d446dfc-2hzcw condition met
pod/nginx-677d446dfc-mdsgg condition met
pod/nginx-677d446dfc-rf6kx condition met
It works pretty good compare to deployment without affinity:
$ for i in $(seq 0 4); do kubectl apply -f test.yaml && kubectl wait --for=condition=available --timeout=600s deployment/nginx && kubectl get pod -o=custom-columns=NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName -l app=nginx && kubectl delete -f test.yaml && kubectl wait --for=delete --timeout=600s pod -l app=nginx; done
deployment.apps/nginx created
deployment.apps/nginx condition met
NAME STATUS NODE
nginx-55649fd747-gxdc7 Running kind-worker
nginx-55649fd747-h7b8k Running kind-worker2
nginx-55649fd747-z4jrk Running kind-worker
deployment.apps "nginx" deleted
pod/nginx-55649fd747-gxdc7 condition met
pod/nginx-55649fd747-h7b8k condition met
pod/nginx-55649fd747-z4jrk condition met
deployment.apps/nginx created
deployment.apps/nginx condition met
NAME STATUS NODE
nginx-55649fd747-cf4dx Running kind-worker2
nginx-55649fd747-jlc7t Running kind-worker
nginx-55649fd747-rz7xh Running kind-worker
deployment.apps "nginx" deleted
pod/nginx-55649fd747-cf4dx condition met
pod/nginx-55649fd747-jlc7t condition met
pod/nginx-55649fd747-rz7xh condition met
deployment.apps/nginx created
deployment.apps/nginx condition met
NAME STATUS NODE
nginx-55649fd747-4znvz Running kind-worker
nginx-55649fd747-67vm6 Running kind-worker2
nginx-55649fd747-wn97t Running kind-worker
deployment.apps "nginx" deleted
pod/nginx-55649fd747-4znvz condition met
pod/nginx-55649fd747-67vm6 condition met
pod/nginx-55649fd747-wn97t condition met
deployment.apps/nginx created
deployment.apps/nginx condition met
NAME STATUS NODE
nginx-55649fd747-vxpq6 Running kind-worker2
nginx-55649fd747-xq7cr Running kind-worker
nginx-55649fd747-xs7qr Running kind-worker
deployment.apps "nginx" deleted
pod/nginx-55649fd747-vxpq6 condition met
pod/nginx-55649fd747-xq7cr condition met
pod/nginx-55649fd747-xs7qr condition met
deployment.apps/nginx created
deployment.apps/nginx condition met
NAME STATUS NODE
nginx-55649fd747-6hnn8 Running kind-worker
nginx-55649fd747-9kwgk Running kind-worker
nginx-55649fd747-djjqr Running kind-worker2
deployment.apps "nginx" deleted
pod/nginx-55649fd747-6hnn8 condition met
pod/nginx-55649fd747-9kwgk condition met
pod/nginx-55649fd747-djjqr condition met
So what's happen under the hood ? Kubernetes is using kube-scheduler to assign pods to nodes. We can customize configuration using plugins and policies. In this particular example we take a look at plugins. At different extension point of pod scheduling cycle kube-scheduler is using default plugins. We can disable/enable plugin at different extension point, further we can group it into different profile, which can be used in podspec manifest. First of all, let's take example from Multiple profiles and apply it into current kind cluster:
# copy current kube-scheduler pod manifest
$ docker cp kind-control-plane:/etc/kubernetes/manifests/kube-scheduler.yaml .
$ cat kube-scheduler-config.yaml
apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: KubeSchedulerConfiguration
clientConnection:
kubeconfig: /etc/kubernetes/scheduler.conf
profiles:
- schedulerName: default-scheduler
- schedulerName: no-scoring-scheduler
plugins:
preScore:
disabled:
- name: '*'
score:
disabled:
- name: '*'
# modify kube-scheduler pod manifest
@@ -15,6 +15,7 @@
- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
- --bind-address=127.0.0.1
- --kubeconfig=/etc/kubernetes/scheduler.conf
+ - --config=/usr/local/etc/kube-scheduler-config.yaml
- --leader-elect=true
- --port=0
image: k8s.gcr.io/kube-scheduler:v1.20.2
@@ -47,6 +48,9 @@
- mountPath: /etc/kubernetes/scheduler.conf
name: kubeconfig
readOnly: true
+ - mountPath: /usr/local/etc/kube-scheduler-config.yaml
+ name: kube-scheduler-config
+ readOnly: true
hostNetwork: true
priorityClassName: system-node-critical
volumes:
@@ -54,4 +58,8 @@
path: /etc/kubernetes/scheduler.conf
type: FileOrCreate
name: kubeconfig
+ - hostPath:
+ path: /etc/kubernetes/kube-scheduler-config.yaml
+ type: FileOrCreate
+ name: kube-scheduler-config
status: {}
# upload config and manifest
$ docker cp kube-scheduler-config.yaml kind-control-plane:/etc/kubernetes/kube-scheduler-config.yaml
$ docker cp kube-scheduler.yaml kind-control-plane:/etc/kubernetes/manifests/kube-scheduler.yaml
# verify that kubelet restart kube-scheduler pod
$ docker exec -ti kind-control-plane bash -c "ps ax | grep -i kube-scheduler"
7556 ? Ssl 0:44 kube-scheduler --authentication-kubeconfig=/etc/kubernetes/scheduler.conf --authorization-kubeconfig=/etc/kubernetes/scheduler.conf --bind-address=127.0.0.1 --kubeconfig=/etc/kubernetes/scheduler.conf --config=/usr/local/etc/kube-scheduler-config.yaml --leader-elect=true --port=0
Now we have new schedulerName
called no-scoring-scheduler
we can put it into our test deployment:
# diff
labels:
app: nginx
spec:
+ schedulerName: no-scoring-scheduler
containers:
- name: nginx
image: nginx:latest
all defined affinity rules won't be considered:
$ kubectl apply -f test.yaml && kubectl wait --for=condition=available --timeout=600s deployment/nginx && kubectl get pod -o=custom-columns=NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName -l app=nginx && kubectl delete -f test.yaml && kubectl wait --for=delete --timeout=600s pod -l app=nginx
deployment.apps/nginx created
deployment.apps/nginx condition met
NAME STATUS NODE
nginx-dc7fc7d7-6xppv Running kind-worker
nginx-dc7fc7d7-cflgc Running kind-worker
nginx-dc7fc7d7-d89hh Running kind-control-plane
deployment.apps "nginx" deleted
pod/nginx-dc7fc7d7-6xppv condition met
pod/nginx-dc7fc7d7-cflgc condition met
pod/nginx-dc7fc7d7-d89hh condition met
because we disabled all plugins. Which plugin is dealing with affinity
? Quick take a look:
InterPodAffinity: Implements inter-Pod affinity and anti-affinity. Extension points: PreFilter, Filter, PreScore, Score
At Score extension point each plugin return its computed value. Digging into source code of this particular plugin, when pod antiaffinity is set and selected node contains the same pods (labels) it count score for this node as:
weight(100) * -1 = -100
score -100
is low so this node rather won't be used to assign this pod.
Next time I will try to dig into kube-scheduler
policies.