DEV Community

Pepe
Pepe

Posted on

Troubleshooting Self Managed Node Groups in Terraform EKS

When deploying an EKS cluster with self managed node groups using the Terraform AWS EKS module, a common issue arises where the nodes simply aren’t created. In one case study, the root cause was identified as Terraform attempting to locate a custom AMI in an incorrect catalog. The solution was straightforward: explicitly specify the ami_type parameter.

Overview of the Issue

The Terraform configuration was designed to set up an EKS cluster complete with self managed node groups. However, despite proper settings for VPCs, subnets, and IAM roles, the node groups were never created. The investigation revealed that Terraform was mistakenly searching for a custom AMI in a catalog that didn’t contain the desired image.

Diagnosing the Problem

I was trying to build cluster nodes with AMI based on Amazon Linux 2023. However, AWS Console error message:

Launch template nodegroup-launch-template-123456 should not specify an instance profile. The noderole in your request will be used to construct an instance profile.
Enter fullscreen mode Exit fullscreen mode

and the terraform plan output looks a little bit strange.

Acquiring state lock. This may take a few moments...
module.eks.data.aws_partition.current[0]: Reading...
module.eks.module.kms.data.aws_partition.current[0]: Reading...
module.eks.data.aws_caller_identity.current[0]: Reading...
module.eks.data.aws_iam_policy_document.assume_role_policy[0]: Reading...
data.aws_caller_identity.current: Reading...
module.eks.module.kms.data.aws_caller_identity.current[0]: Reading...
module.eks.module.kms.data.aws_partition.current[0]: Read complete after 0s [id=aws]
module.eks.data.aws_partition.current[0]: Read complete after 0s [id=aws]
module.eks.data.aws_iam_policy_document.assume_role_policy[0]: Read complete after 0s [id=xxxxxx]
module.eks.data.aws_caller_identity.current[0]: Read complete after 0s [id=xxxxxx]
module.eks.data.aws_iam_policy_document.custom[0]: Reading...
module.eks.data.aws_iam_session_context.current[0]: Reading...
data.aws_caller_identity.current: Read complete after 0s [id=xxxxxx]
module.eks.data.aws_iam_policy_document.custom[0]: Read complete after 0s [id=xxxxxx]
module.eks.module.kms.data.aws_caller_identity.current[0]: Read complete after 0s [id=xxxxxx]
module.eks.module.self_managed_node_group["custom_nodegroup "].data.aws_partition.current: Reading...
module.eks.module.self_managed_node_group["custom_nodegroup "].data.aws_partition.current: Read complete after 0s [id=aws]
module.eks.module.self_managed_node_group["custom_nodegroup "].data.aws_caller_identity.current: Reading...
module.eks.module.self_managed_node_group["custom_nodegroup "].data.aws_iam_policy_document.assume_role_policy[0]: Reading...
module.eks.module.self_managed_node_group["custom_nodegroup "].data.aws_caller_identity.current: Read complete after 0s [id=xxxxxx]
module.eks.module.self_managed_node_group["custom_nodegroup "].data.aws_iam_policy_document.assume_role_policy[0]: Read complete after 0s [id=xxxxxx]
module.eks.data.aws_iam_session_context.current[0]: Read complete after 0s [id=arn:aws:sts::xxxxxx:assumed-role/eks-user/eks-1234-123456]
module.eks.module.self_managed_node_group["custom_nodegroup "].data.aws_ssm_parameter.ami[0]: Reading...
module.eks.module.self_managed_node_group["custom_nodegroup "].data.aws_ssm_parameter.ami[0]: Read complete after 0s [id=/aws/service/eks/optimized-ami/1.32/amazon-linux-2/recommended/image_id]
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
Enter fullscreen mode Exit fullscreen mode

It was trying to locate AMI in the wrong catalog.

[id=/aws/service/eks/optimized-ami/1.32/amazon-linux-2/recommended/image_id]
Enter fullscreen mode Exit fullscreen mode

According to terraform EKS module documentation here, you should use values from official AWS documentation here.

amiType
If the node group was deployed using a launch template with a custom AMI, then this is CUSTOM. For node groups that weren't deployed using a launch template, this is the AMI type that was specified in the node group configuration.

Type: String

Valid Values:

AL2_x86_64 | AL2_x86_64_GPU | AL2_ARM_64 | CUSTOM | BOTTLEROCKET_ARM_64 | BOTTLEROCKET_x86_64 | BOTTLEROCKET_ARM_64_NVIDIA | BOTTLEROCKET_x86_64_NVIDIA | WINDOWS_CORE_2019_x86_64 | WINDOWS_FULL_2019_x86_64 | WINDOWS_CORE_2022_x86_64 | WINDOWS_FULL_2022_x86_64 | AL2023_x86_64_STANDARD | AL2023_ARM_64_STANDARD | AL2023_x86_64_NEURON | AL2023_x86_64_NVIDIA
Enter fullscreen mode Exit fullscreen mode

From the documentation, we have to use CUSTOM value, but it is not true. There is a github issue: #3094. Terraform is not accepting the CUSTOM value. Instead of CUSTOM it has to be value, which the image is based on. In my example, it is AL2023_x86_64_STANDARD.

Fixed version:

variable "ami_type" {
  description = "type of AMI to be used"
  type        = string
  default     = "AL2023_x86_64_STANDARD"
}
Enter fullscreen mode Exit fullscreen mode
  self_managed_node_groups = {
    custom_nodegroup = {
      name            = local.cluster_name
      use_name_prefix = false

      subnet_ids                    = var.subnet_ids
      additional_security_group_ids = [aws_security_group.eks_sg.id]

      instance_type        = var.instance_type
      ami_type             = var.ami_type
      ami_id               = var.ami_id
      key_name             = var.key_name
      asg_min_size         = 1
      asg_desired_capacity = 1
      asg_max_size         = 2

      launch_template = {
        create_launch_template = true
        name                   = "nodegroup-launch-template"
        use_name_prefix        = true
        description            = "Self managed node group launch template"
      }

      enable_monitoring = true

      create_iam_role_policy   = true
      iam_role_name            = local.cluster_name
      iam_role_use_name_prefix = false
      iam_role_description     = "Self managed node group role"

      create_security_group          = true
      security_group_name            = local.cluster_name
      security_group_use_name_prefix = false
      revoke_rules_on_delete         = true
      create_before_destroy          = true
      security_group_description     = "Self managed node group security group"

      security_group_tags = {
        Purpose = "Protector of the kubelet"
      }
    }
  }

  tags = local.tags
}
Enter fullscreen mode Exit fullscreen mode

Conclusion

In this case, the failure to create self managed node groups was not due to security groups or IAM roles alone—it was the misdirection of Terraform attempting to retrieve a custom AMI from the wrong catalog. The simple act of specifying the ami_type parameter resolved the issue by guiding Terraform to the correct image repository.

By ensuring consistent configuration across all parameters and paying close attention to details like AMI selection, you can avoid similar pitfalls and ensure a successful EKS cluster deployment using Terraform.

Top comments (0)