Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAT Gateways not recreated #16876

Open
zaneclaes opened this issue Oct 4, 2024 · 10 comments
Open

NAT Gateways not recreated #16876

zaneclaes opened this issue Oct 4, 2024 · 10 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@zaneclaes
Copy link

zaneclaes commented Oct 4, 2024

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

1.30.1

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.30.2

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

Manually deleted the NAT gateways and EIPs on AWS (whoops).
Tried kops update cluster but it does not detect the deletion; instead it spits out NAT gateway errors:

W1004 15:50:51.222631 35897 executor.go:141] error running task "ElasticIP/us-east" (7m29s remaining to succeed): error finding AssociatedNatGatewayRouteTable: error listing NatGateway "nat-0460de55eeb540794": operation error EC2: DescribeNatGateways, https response error StatusCode: 400, RequestID: 6af7f0d1-1b02-461f-be25-b83f6b4330c9, api error NatGatewayNotFound: The Nat gateway nat-0460de55eeb540794 was not found

5. What happened after the commands executed?

Cluster is no longer working, and no kops commands seem to fix it.

6. What did you expect to happen?

According to #6830 and #6518 this should be fixed.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2024-10-03T13:53:21Z"
  generation: 12
  name: k8s.mysite.com
spec:
  additionalPolicies:
    node: |
      [
        {
          "Effect": "Allow",
          "Action": [
            "sts:AssumeRole"
          ],
          "Resource": [
            "arn:aws:iam::AWS_ID:role/k8s-moongate"
          ]
        }
      ]
  api:
    loadBalancer:
      class: Network
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudControllerManager: {}
  cloudProvider: aws
  configBase: s3://com-state-store/k8s.mysite.com
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: control-plane-us-east-2a
      name: a
    - encryptedVolume: true
      instanceGroup: control-plane-us-east-2b
      name: b
    - encryptedVolume: true
      instanceGroup: control-plane-us-east-2c
      name: c
    manager:
      backupRetentionDays: 90
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: control-plane-us-east-2a
      name: a
    - encryptedVolume: true
      instanceGroup: control-plane-us-east-2b
      name: b
    - encryptedVolume: true
      instanceGroup: control-plane-us-east-2c
      name: c
    manager:
      backupRetentionDays: 90
    memoryRequest: 100Mi
    name: events
  externalPolicies:
    node:
    - arn:aws:iam::AWS_ID:policy/AWSLoadBalancerControllerIAMPolicy2
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeProxy:
    enabled: false
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  - ::/0
  kubernetesVersion: 1.30.2
  masterPublicName: api.k8s.mysite.com
  networkCIDR: 172.20.0.0/16
  networking:
    cilium:
      enableNodePort: true
  nonMasqueradeCIDR: ::/0
  sshAccess:
  - 0.0.0.0/0
  - ::/0
  subnets:
  - cidr: 172.20.0.0/18
    ipv6CIDR: /64#0
    name: us-east-2a
    type: Public
    zone: us-east-2a
  - cidr: 172.20.64.0/18
    ipv6CIDR: /64#1
    name: us-east-2b
    type: Public
    zone: us-east-2b
  - cidr: 172.20.128.0/18
    ipv6CIDR: /64#2
    name: us-east-2c
    type: Public
    zone: us-east-2c
  topology:
    dns:
      type: None

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-10-03T13:53:22Z"
  labels:
    kops.k8s.io/cluster: k8s.mysite.com
  name: control-plane-us-east-2a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240607
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - us-east-2a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-10-03T13:53:22Z"
  labels:
    kops.k8s.io/cluster: k8s.mysite.com
  name: control-plane-us-east-2b
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240607
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - us-east-2b

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-10-03T13:53:22Z"
  labels:
    kops.k8s.io/cluster: k8s.mysite.com
  name: control-plane-us-east-2c
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240607
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - us-east-2c

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-10-03T13:53:22Z"
  generation: 5
  labels:
    kops.k8s.io/cluster: k8s.mysite.com
  name: nodes-us-east-2a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240607
  instanceMetadata:
    httpPutResponseHopLimit: 2
    httpTokens: required
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  role: Node
  subnets:
  - us-east-2a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-10-03T13:53:22Z"
  generation: 5
  labels:
    kops.k8s.io/cluster: k8s.mysite.com
  name: nodes-us-east-2b
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240607
  instanceMetadata:
    httpPutResponseHopLimit: 2
    httpTokens: required
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  role: Node
  subnets:
  - us-east-2b

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-10-03T13:53:23Z"
  generation: 5
  labels:
    kops.k8s.io/cluster: k8s.mysite.com
  name: nodes-us-east-2c
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240607
  instanceMetadata:
    httpPutResponseHopLimit: 2
    httpTokens: required
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  role: Node
  subnets:
  - us-east-2c

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

Fat-fingered NAT deletion... but I really don't want to rebuild the whole cluster 😢 🙏

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 4, 2024
@hakman
Copy link
Member

hakman commented Oct 5, 2024

@zaneclaes It seems that there are some resources (EIPs) that stil reference the deleted NATGW. Could you delete those also and retry?

@zaneclaes
Copy link
Author

zaneclaes commented Oct 5, 2024 via email

@hakman
Copy link
Member

hakman commented Oct 5, 2024

There is something that still mentions "nat-0460de55eeb540794" in logs, maybe routing table, but such references should not exist anymore.
See the error message that you pasted. There is nothing in kOps config that references the NATGW ID.

@zaneclaes
Copy link
Author

zaneclaes commented Oct 5, 2024 via email

@hakman
Copy link
Member

hakman commented Oct 5, 2024

Yes, but kOps does not keep track of the resources. The NGW ID you see there comes from some other resource that used to reference it.

@rifelpet
Copy link
Member

rifelpet commented Oct 5, 2024

According to the Kops source code, one of the AWS route tables for your cluster contains a route to the deleted NGW ID. The route likely has a state of blackhole because the NGW no longer exists. Deleting that route should fix the problem.

@zaneclaes
Copy link
Author

Thanks for the clarifications; that's very helpful. I've cleared out all the routes in the account (by first removing them from the associated subnets) except the default Route Table for the cluster (which cannot be deleted as the default for the VPC). However kops update still gives the same error, leading me to assume there's something else referencing that NAT gateway somewhere in the AWS account (and the solution is not to delete the default route table somehow)...

@rifelpet
Copy link
Member

rifelpet commented Oct 5, 2024

Can you find any Elastic IPs for the cluster? They'll be tagged with your cluster name but i'm not sure if they'll have an association, given their NGW was deleted. Deleting those EIPs may help.

@zaneclaes
Copy link
Author

zaneclaes commented Oct 6, 2024

@rifelpet every time I delete all the Elastic IPs and then run kops update, it recreates the EIPs but then shows the same exact NAT gateway error, without creating any new NAT gateway in the process.

Just to be clear:

  • The only route table in my account is the default for the VPC
  • There are no Elastic IPs at all
  • There are no NAT gateways at all

When I run a kops update, the route tables and EIPs are recreated, but then the same NAT gateway error appears:

W1006 06:59:55.210178   57754 executor.go:141] error running task "NatGateway/us-east-" (9m13s remaining to succeed): error listing NatGateway "nat-0460de55eeb540794": operation error EC2: DescribeNatGateways, https response error StatusCode: 400, RequestID: 2495c451-8c60-42a7-81a3-e34dddcad6e8, api error NatGatewayNotFound: The Nat gateway nat-0460de55eeb540794 was not found

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

5 participants