Introduction to Kubernetes Node Affinity
The pod scheduler is one of the core components of Kubernetes. Whenever an application pod is created as per the user’s request, the scheduler determines the placement of these pods onto worker nodes in the cluster. The scheduler is flexible and can be customized for advanced scheduling situations. Before scheduling a pod on a worker node, the scheduler takes the following points into consideration:
- check if a node selector is defined in the pod definition, if its defined, nodes whose labels do not match are not considered
- check if the worker nodes have enough memory, compute and storage resources
- check if nodes have any taints and if the pod to be scheduled has toleration for these taints
- check affinity and anti-affinity rules
Based on these factors, the scheduler assigns a weighted score to each node and node with the highest score is selected to run the pod.
Kubernetes Node Affinity is successor of nodeSelector
Node affinity rules are used to influence which node a pod is scheduled
to. In the earlier K8s versions, the node affinity mechanism was
implemented through the nodeSelector field in the pod specification.
The node had to include all the labels specified in that field to become
eligible for hosting the pod.
Node affinity is a more sophisticated form of nodeSelector as it
offers a much wider selection criteria. Each pod can specify its
preferences and requirements by defining it’s own node affinity rules.
Based on these rules, the Kubernetes scheduler will try to place the pod
to one of the nodes matching the defined criteria.
Prerequisites
- You must have a working Kubernetes Cluster with a least two worker nodes to implement these scenarios. Single node clusters, such as Minikube will not work for this demo.
- You must have a working knowledge of Kubernetes and yaml language.
Add Label to Worker Nodes
We’ll first proceed to label the two worker nodes in our cluster as follows:
kubectl label nodes worker-n1.k8s.local tier=gold
kubectl label nodes worker-n2.k8s.local tier=silver
[root@masternode ~]# kubectl label nodes worker-n1.k8s.local tier=gold
node/worker-n1.k8s.local labeled
[root@masternode ~]# kubectl label nodes worker-n2.k8s.local tier=silver
node/worker-n2.k8s.local labeled
[root@masternode ~]#
Create a deployment with Node Affinity
We’re now going to create a sample deployment and enforce node affinity rules using the node labels defined earlier.
Create a sample nginx deployment as follows:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: nginx
name: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: tier
operator: In
values:
- gold
containers:
- image: nginx
name: nginx
Create this deployment using kubectl:
kubectl create -f nginx.yml
Kubernetes Node Affinity in Action
Let’s walk through the node affinity specification. The
“requiredDuringSchedulingIgnoredDuringExecution” directive can be
broken down into two parts:
- requiredDuringScheduling means that rules under this field specify the labels the node must have for the pod to be scheduled to the node
- IgnoredDuringExecution means the affinity rules will not affect pods that are already running on the node
Setting this directive ensures that affinity only affects the
scheduling of a new
pod and never causes a pod to be evicted from a node. The
nodeSelectorTerms and the matchExpressions fields define the values
that the node’s label must match for the pod to be scheduled to the
node. In our case, it means that the node must have a label “tier”
whose value should be set to “gold”.
[root@masternode ~]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-846f64db7c-4hh46 1/1 Running 0 16m 192.168.52.139 worker-n1.k8s.local <none>
[root@masternode ~]# kubectl get node worker-n1.k8s.local -L tier
NAME STATUS ROLES AGE VERSION TIER
worker-n1.k8s.local Ready <none> 4d7h v1.24.2 gold
Even if we scale our deployment to multiple replicas, the resulting pods
will always be scheduled on the same node, i.e. with the label
“tier=gold”.
kubectl scale --replicas=3 deployment.apps/nginx
Let’s take this one step further. We’ll now remove the label
“tier=gold” from the worker node, further scale our deployment to four
replicas.
[root@masternode ~]# kubectl label nodes worker-n1.k8s.local tier-
node/worker-n1.k8s.local unlabeled
[root@masternode ~]#
[root@masternode ~]# kubectl scale --replicas=4 deployment.apps/nginx
deployment.apps/nginx scaled
[root@masternode ~]#
[root@masternode ~]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-846f64db7c-4hh46 1/1 Running 0 23m 192.168.52.139 worker-n1.k8s.local <none> <none>
nginx-846f64db7c-88jb7 1/1 Running 0 3m8s 192.168.52.145 worker-n1.k8s.local <none> <none>
nginx-846f64db7c-r8g4b 0/1 Pending 0 27s <none> <none> <none> <none>
nginx-846f64db7c-v75bc 1/1 Running 0 3m19s 192.168.52.144 worker-n1.k8s.local <none> <none>
[root@masternode ~]#
As can be seen, the new pod goes in pending state. The old pods are
still running, which makes sense as the deployment contains the
“requiredDuringSchedulingIgnoredDuringExecution” field, which ensures
that running pods are not evicted. If we look into the events, then it
can be seen that the new pod could not be scheduled as no node with the
matching label was available.
[root@masternode ~]# kubectl describe pod/nginx-846f64db7c-r8g4b |tail -4
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m26s default-scheduler 0/3 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
[root@masternode ~]#
Kubernetes Node Anti-Affinity in Action
Similar to node affinity, node anti-affinity rules can be defined to ensure that a pod is not assigned to a particular group of nodes. These rules define which nodes should not be considered when scheduling a pod. Let’s consider the same nginx deployment configuration which we used for node affinity. We only need to update the operator field in the spec section.
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: tier
operator: NotIn
values:
- gold
The “NotIn” value in the operator field is defining the
anti-affinity behavior here. This is going to ensure that the pod is not
scheduled on any node which has the label “tier=gold”, as evident from
the output below.
kubectl create -f nginx.yml
[root@masternode ~]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-7547cdb594-w9k7n 1/1 Running 0 17s 192.168.0.3 worker-n2.k8s.local <none> <none>
[root@masternode ~]# kubectl get nodes -L tier
NAME STATUS ROLES AGE VERSION TIER
worker-n1.k8s.local Ready <none> 4d4h v1.24.2 gold
worker-n2.k8s.local Ready <none> 4d4h v1.24.2 silver
masternode.k8s.local Ready control-plane 4d4h v1.24.2
[root@masternode ~]#
Kubernetes Node Affinity Weight in Action
If you have a cluster with multiple worker nodes, then it is quite
possible that more than one node matches the defined affinity rules. In
this case, to schedule the pod on a node of our choice, we can assign a
weighted score to each affinity rule and prioritize among the selected
nodes. This is done through the
preferredDuringSchedulingIgnoredDuringExecution field. Let’s explain
this using an example.
Create an nginx deployment as follows:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: nginx
name: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: tier
operator: In
values:
- gold
- silver
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 10
preference:
matchExpressions:
- key: tier
operator: In
values:
- gold
- weight: 50
preference:
matchExpressions:
- key: tier
operator: In
values:
- silver
containers:
- image: nginx
name: nginx
The “preferredDuringSchedulingIgnoredDuringExecution” directive can be
broken down into two parts:
- preferredDuringScheduling means that rules under this field are preferred for the pod to be scheduled to the node. Note that we’re specifying preferences, not hard requirements.
- IgnoredDuringExecution means the affinity rules will not affect pods that are already running on the node
The requiredDuringSchedulingIgnoredDuringExecution section dictates
that the pods under this deployment should be scheduled on nodes which
either have the label tier=gold or tier=silver. Under the
preferredDuringSchedulingIgnoredDuringExecution section, we define
our preference for pod scheduling and assign weighted scores to our
preferences.
Because of the hard requirement under
requiredDuringSchedulingIgnoredDuringExecution section, the scheduler
will select nodes with the label tier=gold and tier=silver for
running the pod. Based on our defined preference under
preferredDuringSchedulingIgnoredDuringExecution, the scheduler will
iterate through the rules and assign a weighted score to each rule, 10
when tier=gold and 50 when tier=silver. The scheduler will add
this score to other priority functions and select the node with the
highest score for running the pod. In this case, the node with the label
tier=silver, i.e. worker-n2 will be selected.
kubectl create -f nginx.yml
[root@masternode ~]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-5898d7588b-nnlgb 1/1 Running 0 4s 192.168.0.11 worker-n2.k8s.local <none> <none>
[root@masternode ~]#
Conclusion
The pod scheduler in Kubernetes offers a lot of flexibility in scheduling application pods as per user requirements. To determine which nodes are acceptable for scheduling a pod, the scheduler evaluates each node against multiple factors. The end user can dictate or prioritize the nodes when running pods by defining node affinity and anti-affinity rules.


