Troubleshoot Karpenter Node

Troubleshoot Karpenter Node

ยท

5 min read

Abstract

  • It often goes smoothly when using Karpenter with the fresh new EKS/kubernetes cluster as nodes are provisioned and able to join the cluster, or it works well on this region but not on others, why???

  • This blog describes what issues I faced and how to troubleshoot them.

Table Of Contents


๐Ÿš€ Pre-requisite

๐Ÿš€ The issues

๐Ÿš€ Check IAM instance profile

  • The condition for an EKS node to join the cluster is IAM instance profile which provides IAM permission for the nodes to access AWS resources and call API to EKS cluster.

  • Furthermore, we also need to add the instance profile to aws-auth configmap to grant system:masters permissions in the cluster's role-based access control (RBAC) configuration in the Amazon EKS control plane. Read more Enabling IAM user and role access to your cluster

    aws-auth configmap

      apiVersion: v1
      data:
        mapAccounts: |
          []
        mapRoles: |
          - "groups":
            - "system:nodes"
            - "system:bootstrappers"
            "rolearn": "arn:aws:iam::<accountID>:role/KarpenterNodeInstanceProfile"
            "username": "system:node:{{EC2PrivateDNSName}}"
    
  • The right permissions that an instance profile needs

      AmazonEKSWorkerNodePolicy
      AmazonEC2ContainerRegistryReadOnly
      AmazonSSMManagedInstanceCore
      AmazonEKS_CNI_Policy
    
  • Instance profile role's Trust Policy:

      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "EKSWorkerAssumeRole",
                  "Effect": "Allow",
                  "Principal": {
                      "Service": "ec2.amazonaws.com"
                  },
                  "Action": "sts:AssumeRole"
              }
          ]
      }
    

๐Ÿš€ Check kube-proxy and aws-node Daemonset

  • When a new node joins the cluster, 2 Daemonset Pods must be created on it

    • kube-system which provides critical networking functionality for Kubernetes applications

    • aws-node which is the Amazon VPC Container Network Interface (CNI) plugin for Kubernetes If not some checks should be performed

So we need to verify those pods are in Running status on the worker node. If not, perform the following checks.

๐Ÿš€ Check Security group and Elastic network interface

  • We need to ensure traffic between Pods because the coredns Pods are in Deployment type and created on a number of nodes that the new ones must be allowed traffic to

  • Checkout Understand Pods communication

๐Ÿš€ Debug where the kubelet stuck at

  • If there are no kube-proxy and aws-node Daemonsets on the node, the kubelet failed to start. Look at the kubelet logs

      Apr 23 05:26:38 ip-10-10-11-100 containerd: time="2022-04-23T05:26:38.814171952Z" level=error msg="failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
    
  • Karpenter creates the node object, and then Kubelet populates the necessary labels once it can connect to the APIServer.

    • kube-proxy nodeAffinity

        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: beta.kubernetes.io/os
                    operator: In
                    values:
                    - linux
                  - key: beta.kubernetes.io/arch
                    operator: In
                    values:
                    - amd64
      
  • Due to kubelet failed to start so the node does not contain those labels

      # kubectl describe node ip-10-10-11-198.ap-southeast-1.compute.internal
      Name:               ip-10-10-11-198.ap-southeast-1.compute.internal
      Roles:              <none>
      Labels:             karpenter.sh/capacity-type=spot
                          karpenter.sh/provisioner-name=karpenter-test
                          lifecycle=spot
                          node.kubernetes.io/instance-type=t3.large
                          role=karpenter
                          topology.kubernetes.io/zone=ap-southeast-1b
                          type=test
      Annotations:        node.alpha.kubernetes.io/ttl: 0
      CreationTimestamp:  Fri, 22 Apr 2022 15:37:34 +0000
      Taints:             node.kubernetes.io/unreachable:NoExecute
                          dedicated=karpenter:NoSchedule
                          karpenter.sh/not-ready:NoSchedule
                          node.kubernetes.io/unreachable:NoSchedule
      Unschedulable:      false
      Lease:              Failed to get lease: leases.coordination.k8s.io "ip-10-10-11-198.ap-southeast-1.compute.internal" not found
      Conditions:
        Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason                   Message
        ----             ------    -----------------                 ------------------                ------                   -------
        Ready            Unknown   Fri, 22 Apr 2022 15:37:34 +0000   Fri, 22 Apr 2022 15:38:34 +0000   NodeStatusNeverUpdated   Kubelet never posted node status.
        MemoryPressure   Unknown   Fri, 22 Apr 2022 15:37:34 +0000   Fri, 22 Apr 2022 15:38:34 +0000   NodeStatusNeverUpdated   Kubelet never posted node status.
        DiskPressure     Unknown   Fri, 22 Apr 2022 15:37:34 +0000   Fri, 22 Apr 2022 15:38:34 +0000   NodeStatusNeverUpdated   Kubelet never posted node status.
        PIDPressure      Unknown   Fri, 22 Apr 2022 15:37:34 +0000   Fri, 22 Apr 2022 15:38:34 +0000   NodeStatusNeverUpdated   Kubelet never posted node status.
    
  • Now we get into the node by using SSM or SSH to check kubelet logs, get the kubelet bootstrap command to run directly inside the node

      # /etc/eks/bootstrap.sh dev-d1 --apiserver-endpoint https://111111111111111111111.yl4.ap-southeast-1.eks.amazonaws.com --b64-cluster-ca xxx= --container-runtime containerd --kubelet-extra-args '--node-labels=role=karpenter,type=test,karpenter.sh/provisioner-name=karpenter-test,lifecycle=spot,karpenter.sh/capacity-type=spot --register-with-taints=dedicated=karpenter:NoSchedule'
      sed: can't read /etc/eks/containerd/containerd-config.toml: No such file or directory
      Exited with error on line 469
    
  • Ooh, we're missing the configuration file for containerd containerd-config.toml. Check the status of sandbox-image.service and we see it failed to start

      # systemctl status sandbox-image -l
      โ— sandbox-image.service - pull sandbox image defined in containerd config.toml
        Loaded: loaded (/etc/systemd/system/sandbox-image.service; enabled; vendor preset: disabled)
        Active: activating (start) since Sat 2022-04-23 08:52:45 UTC; 12min ago
      Main PID: 2965 (bash)
          Tasks: 2
        Memory: 17.0M
        CGroup: /system.slice/sandbox-image.service
                โ”œโ”€2965 bash /etc/eks/containerd/pull-sandbox-image.sh
                โ””โ”€3691 sleep 170
    
      Apr 23 09:05:42 ip-10-10-11-162.vc.p pull-sandbox-image.sh[2965]: github.com/containerd/containerd/vendor/github.com/urfave/cli.(*App).RunAsSubcommand(0xc00032c540, 0xc0000ecf20, 0x0, 0x0)
      Apr 23 09:05:42 ip-10-10-11-162.vc.p pull-sandbox-image.sh[2965]: /builddir/build/BUILD/containerd-1.4.13-2.amzn2.0.1/src/github.com/containerd/containerd/vendor/github.com/urfave/cli/app.go:404 +0x8f4
      Apr 23 09:05:42 ip-10-10-11-162.vc.p pull-sandbox-image.sh[2965]: main.main()
      Apr 23 09:05:42 ip-10-10-11-162.vc.p pull-sandbox-image.sh[2965]: github.com/containerd/containerd/cmd/ctr/main.go:37 +0x125
    
  • Try to run the script for debuging and we see the script stuck at AWS ECR login.

      # bash -x /etc/eks/containerd/pull-sandbox-image.sh
      ++ awk '-F[ ="]+' '$1 == "sandbox_image" { print $2 }' /etc/containerd/config.toml
      + sandbox_image=123456789012.dkr.ecr.ap-southeast-1.amazonaws.com/eks/pause:3.1-eksbuild.1
      ++ echo 123456789012.dkr.ecr.ap-southeast-1.amazonaws.com/eks/pause:3.1-eksbuild.1
      ++ cut -f4 -d .
      + region=ap-southeast-1
      ++ aws ecr get-login-password --region ap-southeast-1
    
  • Now we are near to the root cause.

๐Ÿš€ Check AWS service endpoints

  • We know that our node is getting stuck at AWS ECR login. For security, the Amazon ECR which holds the production repository are often put in the private network so we need to verify that our worker node can access API endpoints for Amazon ECR (and even Amazon EC2 and S3 if we don't pubic the service endpoints)

  • Check the service endpoint (go to VPC -> Endpoints) to allow traffics from our worker node to the VPC endpoint.

Conclusion

  • There are many usecase for the issue of worker nodes failing to join the EKS/kubernetes cluster so this blog just guides you the way and some clues to troubleshoot.

References: