How to Restart an EKS Cluster?

Restarting an EKS (Elastic Kubernetes Service) cluster primarily refers to managing the lifecycle of its worker nodes, as the EKS control plane is a fully managed service provided by AWS and does not require user-initiated restarts. While you cannot "restart" the entire EKS control plane, you can effectively restart or replace individual worker nodes or specific applications running within your cluster to apply updates, resolve issues, or perform maintenance.

Understanding EKS Cluster "Restart"

When users refer to restarting an EKS cluster, they typically mean one of the following:

Restarting Worker Nodes: This is the most common scenario, involving rebooting or replacing the EC2 instances that function as Kubernetes worker nodes. This is crucial for applying operating system updates, kernel patches, or resolving node-specific issues.
Restarting Applications/Pods: This involves terminating and recreating specific pods or deployments within the cluster, often used to apply application updates or clear transient issues.
Cluster Upgrades: While not a "restart," upgrading the Kubernetes version of your EKS cluster involves a managed process that can lead to node replacements and control plane updates handled by AWS.

Step-by-Step Guide to Restarting an EKS Worker Node

Restarting a worker node is a critical operation that needs to be performed carefully to avoid service disruption. This process ensures that running applications are gracefully moved to other available nodes before the target node is taken offline.

Why Restart a Worker Node?

Worker nodes may need to be restarted for various reasons, including:

Operating System Updates: Applying security patches or system upgrades to the underlying EC2 instances.
Kernel Updates: Essential for security and performance improvements.
Troubleshooting: Resolving persistent issues with a specific node, such as resource exhaustion or network problems.
Configuration Changes: Applying changes that require a node reboot.

Prerequisites

Before you begin, ensure you have:

kubectl configured: To interact with your Kubernetes cluster.
AWS CLI configured: To manage your EC2 instances.
Sufficient available capacity: Your cluster must have enough healthy nodes to temporarily host the workloads from the node being restarted without impacting performance or availability.
Application readiness: Ensure your applications are designed for high availability and can tolerate node failures (e.g., using Deployments with multiple replicas).

The Node Restart Process

Here's a detailed breakdown of how to gracefully restart an EKS worker node, minimizing impact on your running applications:

Cordon the Node:
- Purpose: This step prevents Kubernetes from scheduling any new pods onto the node you intend to restart. Existing pods will continue to run.
- Command Example:
```
kubectl cordon <node-name>
```
- Verification:
```
kubectl get nodes -o wide | grep <node-name>
```
  Look for SchedulingDisabled in the STATUS column.
Drain the Node:
- Purpose: After cordoning, you need to evict all existing pods from the node. Kubernetes will reschedule these pods onto other available nodes in your cluster. This step is crucial for minimizing downtime for your applications.
- Command Example:
```
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force
```
  - --ignore-daemonsets: DaemonSets typically run one pod per node and cannot be migrated. They will be recreated on the node once it rejoins the cluster.
  - --delete-emptydir-data: Acknowledge that data in emptyDir volumes will be lost.
  - --force: Required for some pods, especially those not managed by a controller (e.g., bare pods), or if you need to drain quickly. Use with caution.
- Verification:
```
kubectl get pods -o wide --all-namespaces | grep <node-name>
```
  You should see no application pods running on the drained node.
Shut Down/Restart the Node (Provider-Specific):
- Purpose: Once the node is cordoned and drained, you can safely shut down or restart the underlying EC2 instance.
- Method:
  - Reboot: A simple restart of the instance.
```
aws ec2 reboot-instances --instance-ids <instance-id>
```
  - Stop/Start: Stops the instance completely and then starts it. This can be necessary for certain instance type changes or deeper-level issues. Note that a stop/start will change the public IP address if one is assigned (unless using an Elastic IP).
```
aws ec2 stop-instances --instance-ids <instance-id>
aws ec2 start-instances --instance-ids <instance-id>
```
  - Replace Node: For critical updates or persistent issues, it's often safer to terminate the node and let your Auto Scaling Group (ASG) provision a new one. This ensures a clean slate.
```
aws ec2 terminate-instances --instance-ids <instance-id>
```
    (Ensure your ASG desired capacity is appropriately set).
- Identify Instance ID: You can find the EC2 instance ID associated with your Kubernetes node name using kubectl describe node <node-name> and looking for ProviderID.
Uncordon the Node:
- Purpose: After the node has successfully restarted and rejoined the cluster (and all necessary services like kubelet are running), uncordon it to allow Kubernetes to schedule new pods onto it.
- Command Example:
```
kubectl uncordon <node-name>
```
- Verification:
```
kubectl get nodes -o wide | grep <node-name>
```
  The STATUS should revert to Ready.
Verify Node and Workload Health:
- Purpose: Confirm that the node is healthy and that workloads are being scheduled and running as expected.
- Verification:
```
kubectl get nodes
kubectl get pods --all-namespaces -o wide | grep <node-name>
kubectl get deployments --all-namespaces
```
  Check logs and application endpoints to ensure everything is functioning correctly.

Restarting Nodes in Managed vs. Self-Managed Node Groups

The approach to node restarts can vary slightly depending on whether you use EKS Managed Node Groups or self-managed node groups.

Managed Node Groups

AWS manages the lifecycle of the EC2 instances, including automatic updates for AMI (Amazon Machine Image) and Kubernetes versions.
When a new AMI or Kubernetes version is available, you can initiate an update. AWS performs a rolling update, cordoning and draining nodes one by one, replacing them with new instances running the updated configuration. This minimizes disruption.
For a manual "restart" (e.g., applying custom changes or troubleshooting), you can still use the kubectl cordon and drain steps, then update the launch template or launch configuration associated with the managed node group (if applicable) and initiate an update via the EKS console or AWS CLI/eksctl. This will trigger a rolling replacement.

Self-Managed Node Groups

You are responsible for creating, managing, and updating the EC2 instances.
The manual cordon, drain, and restart/replace steps outlined above are directly applicable.
Automation tools like AWS Systems Manager, Ansible, or custom scripts are often used to manage rolling restarts across multiple nodes.

Restarting Applications or Pods

If the goal is to restart a specific application or its pods without affecting the entire node, Kubernetes provides mechanisms for this.

Deployment Rollout Restart:
- Purpose: Force a rolling restart of all pods managed by a Deployment, effectively creating new pods with the latest configuration or clearing transient issues.
- Command Example:
```
kubectl rollout restart deployment/<deployment-name> -n <namespace>
```
- Verification:
```
kubectl rollout status deployment/<deployment-name> -n <namespace>
```
  You can also watch the pod names change: kubectl get pods -n <namespace> -w

Best Practices for Node Maintenance

Automate: For production environments, automate node maintenance (e.g., using CI/CD pipelines, AWS Systems Manager, or eksctl upgrade commands for node groups) to reduce manual effort and human error.
Monitor: Continuously monitor node health, pod status, and application performance during and after any restart operation.
Test in Non-Production: Always test your restart procedures in a staging or development environment before applying them to production.
High Availability: Design your applications with multiple replicas spread across different nodes and availability zones to tolerate individual node failures or restarts.
Pod Disruption Budgets (PDBs): Implement PDBs to ensure that a minimum number of replicas for an application remain available during voluntary disruptions like node drains.

By following these guidelines, you can effectively manage restarts and maintenance for your EKS worker nodes and applications while maintaining cluster stability and application availability.