Abstract
It often goes smoothly when using Karpenter with the fresh new EKS/kubernetes cluster as nodes are provisioned and able to join the cluster, or it works well on this region but not on others, why???
This blog describes what issues I faced and how to troubleshoot them.
Table Of Contents
๐ Pre-requisite
๐ The issues
The following might be one/more issues:
No
kube-proxy
andaws-node
Daemonset Pods are created on the Karpenter nodes.kubelet
failed to start so node is stuck inNotReady
Following is the case-by-case that we need to troubleshoot to find out the rootcause
๐ Check IAM instance profile
The condition for an EKS node to join the cluster is IAM instance profile which provides IAM permission for the nodes to access AWS resources and call API to EKS cluster.
Furthermore, we also need to add the instance profile to
aws-auth
configmap to grantsystem:masters
permissions in the cluster's role-based access control (RBAC) configuration in the Amazon EKS control plane. Read more Enabling IAM user and role access to your clusteraws-auth configmap
apiVersion: v1 data: mapAccounts: | [] mapRoles: | - "groups": - "system:nodes" - "system:bootstrappers" "rolearn": "arn:aws:iam::<accountID>:role/KarpenterNodeInstanceProfile" "username": "system:node:{{EC2PrivateDNSName}}"
The right permissions that an instance profile needs
AmazonEKSWorkerNodePolicy AmazonEC2ContainerRegistryReadOnly AmazonSSMManagedInstanceCore AmazonEKS_CNI_Policy
Instance profile role's Trust Policy:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "EKSWorkerAssumeRole", "Effect": "Allow", "Principal": { "Service": "ec2.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }
๐ Check kube-proxy and aws-node Daemonset
When a new node joins the cluster, 2 Daemonset Pods must be created on it
kube-system
which provides critical networking functionality for Kubernetes applicationsaws-node
which is the Amazon VPC Container Network Interface (CNI) plugin for Kubernetes If not some checks should be performed
So we need to verify those pods are in Running status on the worker node. If not, perform the following checks.
๐ Check Security group and Elastic network interface
We need to ensure traffic between Pods because the
coredns
Pods are in Deployment type and created on a number of nodes that the new ones must be allowed traffic toCheckout Understand Pods communication
๐ Debug where the kubelet
stuck at
If there are no
kube-proxy
andaws-node
Daemonsets on the node, thekubelet
failed to start. Look at thekubelet
logsApr 23 05:26:38 ip-10-10-11-100 containerd: time="2022-04-23T05:26:38.814171952Z" level=error msg="failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
Karpenter creates the node object, and then Kubelet populates the necessary labels once it can connect to the APIServer.
kube-proxy
nodeAffinity
spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: beta.kubernetes.io/os operator: In values: - linux - key: beta.kubernetes.io/arch operator: In values: - amd64
Due to
kubelet
failed to start so the node does not contain those labels# kubectl describe node ip-10-10-11-198.ap-southeast-1.compute.internal Name: ip-10-10-11-198.ap-southeast-1.compute.internal Roles: <none> Labels: karpenter.sh/capacity-type=spot karpenter.sh/provisioner-name=karpenter-test lifecycle=spot node.kubernetes.io/instance-type=t3.large role=karpenter topology.kubernetes.io/zone=ap-southeast-1b type=test Annotations: node.alpha.kubernetes.io/ttl: 0 CreationTimestamp: Fri, 22 Apr 2022 15:37:34 +0000 Taints: node.kubernetes.io/unreachable:NoExecute dedicated=karpenter:NoSchedule karpenter.sh/not-ready:NoSchedule node.kubernetes.io/unreachable:NoSchedule Unschedulable: false Lease: Failed to get lease: leases.coordination.k8s.io "ip-10-10-11-198.ap-southeast-1.compute.internal" not found Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- Ready Unknown Fri, 22 Apr 2022 15:37:34 +0000 Fri, 22 Apr 2022 15:38:34 +0000 NodeStatusNeverUpdated Kubelet never posted node status. MemoryPressure Unknown Fri, 22 Apr 2022 15:37:34 +0000 Fri, 22 Apr 2022 15:38:34 +0000 NodeStatusNeverUpdated Kubelet never posted node status. DiskPressure Unknown Fri, 22 Apr 2022 15:37:34 +0000 Fri, 22 Apr 2022 15:38:34 +0000 NodeStatusNeverUpdated Kubelet never posted node status. PIDPressure Unknown Fri, 22 Apr 2022 15:37:34 +0000 Fri, 22 Apr 2022 15:38:34 +0000 NodeStatusNeverUpdated Kubelet never posted node status.
Now we get into the node by using SSM or SSH to check kubelet logs, get the
kubelet
bootstrap command to run directly inside the node# /etc/eks/bootstrap.sh dev-d1 --apiserver-endpoint https://111111111111111111111.yl4.ap-southeast-1.eks.amazonaws.com --b64-cluster-ca xxx= --container-runtime containerd --kubelet-extra-args '--node-labels=role=karpenter,type=test,karpenter.sh/provisioner-name=karpenter-test,lifecycle=spot,karpenter.sh/capacity-type=spot --register-with-taints=dedicated=karpenter:NoSchedule' sed: can't read /etc/eks/containerd/containerd-config.toml: No such file or directory Exited with error on line 469
Ooh, we're missing the configuration file for
containerd
containerd-config.toml. Check the status ofsandbox-image.service
and we see it failed to start# systemctl status sandbox-image -l โ sandbox-image.service - pull sandbox image defined in containerd config.toml Loaded: loaded (/etc/systemd/system/sandbox-image.service; enabled; vendor preset: disabled) Active: activating (start) since Sat 2022-04-23 08:52:45 UTC; 12min ago Main PID: 2965 (bash) Tasks: 2 Memory: 17.0M CGroup: /system.slice/sandbox-image.service โโ2965 bash /etc/eks/containerd/pull-sandbox-image.sh โโ3691 sleep 170 Apr 23 09:05:42 ip-10-10-11-162.vc.p pull-sandbox-image.sh[2965]: github.com/containerd/containerd/vendor/github.com/urfave/cli.(*App).RunAsSubcommand(0xc00032c540, 0xc0000ecf20, 0x0, 0x0) Apr 23 09:05:42 ip-10-10-11-162.vc.p pull-sandbox-image.sh[2965]: /builddir/build/BUILD/containerd-1.4.13-2.amzn2.0.1/src/github.com/containerd/containerd/vendor/github.com/urfave/cli/app.go:404 +0x8f4 Apr 23 09:05:42 ip-10-10-11-162.vc.p pull-sandbox-image.sh[2965]: main.main() Apr 23 09:05:42 ip-10-10-11-162.vc.p pull-sandbox-image.sh[2965]: github.com/containerd/containerd/cmd/ctr/main.go:37 +0x125
Try to run the script for debuging and we see the script stuck at AWS ECR login.
# bash -x /etc/eks/containerd/pull-sandbox-image.sh ++ awk '-F[ ="]+' '$1 == "sandbox_image" { print $2 }' /etc/containerd/config.toml + sandbox_image=123456789012.dkr.ecr.ap-southeast-1.amazonaws.com/eks/pause:3.1-eksbuild.1 ++ echo 123456789012.dkr.ecr.ap-southeast-1.amazonaws.com/eks/pause:3.1-eksbuild.1 ++ cut -f4 -d . + region=ap-southeast-1 ++ aws ecr get-login-password --region ap-southeast-1
Now we are near to the root cause.
๐ Check AWS service endpoints
We know that our node is getting stuck at AWS ECR login. For security, the Amazon ECR which holds the production repository are often put in the private network so we need to verify that our worker node can access API endpoints for Amazon ECR (and even Amazon EC2 and S3 if we don't pubic the service endpoints)
Check the service endpoint (go to VPC -> Endpoints) to allow traffics from our worker node to the VPC endpoint.
Conclusion
- There are many usecase for the issue of worker nodes failing to join the EKS/kubernetes cluster so this blog just guides you the way and some clues to troubleshoot.
References: