Understanding Node Problem Detector in Kubernetes: Beyond Default Node Conditions

#kubernetes #cloudnative #linux #cloudcomputing

Introducton

In Kubernetes, monitoring node health is crucial for maintaining a reliable cluster. While Kubernetes provides built-in node conditions, these basic health checks might not be sufficient for production environments. This is where Node Problem Detector (NPD) comes in, extending the default monitoring capabilities with rich system-level problem detection.

This article delves into the features and benefits of NPD, showing how it extends beyond the default Kubernetes node healthy monitoring to proactively detect and address potential node issues.

Default Kubernetes Node Conditions

By default, Kubernetes nodes come with several built-in conditions that provide basic health information about the nodes in the cluster.

These conditions are:

Ready: Is the node healthy and able to schedule pods?
MemoryPressure: Is the node running low on memory?
DiskPressure: Are disk space or I/O operations causing problems?
PIDPressure: Is the node overloaded with too many processes?
NetworkUnavailable: Are network configurations causing connectivity issues?

Each condition is represented by status indicators that describe the current health or operational state of a node. There are three possible statuses:

True: The condition is currently happening. For instance, if MemoryPressure is True, it means the node is experiencing memory pressure at the moment.
False: The condition is not happening. For example, if DiskPressure is False, the node has sufficient disk space and no I/O issues.
Unknown: The system cannot determine the status of the condition, often due to a lack of communication or incomplete data from the node.

These conditions can be viewed using the command:

kubectl describe node <node-name>

This command will return each node condition along with its respective status.

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Mon, 13 Jan 2025 21:19:43 +0100   Sun, 01 Dec 2024 01:03:13 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 13 Jan 2025 21:19:43 +0100   Sun, 01 Dec 2024 01:03:13 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 13 Jan 2025 21:19:43 +0100   Sun, 01 Dec 2024 01:03:13 +0100   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Mon, 13 Jan 2025 21:19:43 +0100   Sun, 01 Dec 2024 01:03:33 +0100   KubeletReady                 kubelet is posting ready status

Based on these statuses, Kubernetes adds the necessary taints that match the condition affecting the node. While these default conditions offer a quick glimpse into a node’s health, they may miss deeper, system-level issues. This is where Node Problem Detector steps in to fill the gap.

Node Problem Detector: Enhanced Node Monitoring

Node Problem Detector extends Kubernetes' native node monitoring capabilities by detecting and reporting various system-level issues. It runs as a daemon on the node, detects node problems, and reports them to the apiserver.

How Node Problem Detector Works

The problem daemon is the core component that monitors and detects node problems. Its function is to identify and report specific node problems to the node problem detector. NPD supports several types of problem daemons:

SystemLogMonitor: Watches system logs (journald, syslog, etc) for predefined patterns and reports problems and metrics accordingly. The types of node conditions reported by this daemon are:
- KernelDeadlock
- ReadonlyFilesystem
- FrequentDockerRestart
- FrequentKubeletRestart
- FrequentContainerdRestart
CustomPluginMonitor: Executes custom scripts for specific problem detection.
HealthChecker: Performs periodic health checks. The types of node conditions reported by this daemon are KubeletUnhealthy and ContainerRuntimeUnhealthy.

Upon detection of problems, NPD makes the problem visible to the Kubernetes management stack through the apiserver. Problems are reported as NodeCondition (if it is a permanent problem that will make the node unavailable for pod scheduling) or Event (if it is a temporary problem that has limited impact).

Deploying Node Problem Detector

Method 1: Using Helm

NPD can be deployed using the official Node Problem Detector Helm chart:

helm repo add deliveryhero https://charts.deliveryhero.io/
helm install node-problem-detector deliveryhero/node-problem-detector \
  --namespace kube-system

Method 2: As a System Service

For environments without DaemonSet support, NPD can run as a system service. To achieve this:

Download the Node Problem Detector binaries.
Create a systemd service file.
Enable the service using systemd commands.
Start the Node Problem Detector service.

Customizing Node Problem Detector

One of NPD’s standout features is its ability to adapt to your specific needs. By leveraging the CustomPluginMonitor problem daemon, you can define custom node conditions and rules to monitor exactly what matters most to your workloads.

1. Adding Custom Conditions and Detection Rules

This example demonstrates a custom-plugin JSON file. This file defines custom condition and rules that enable NPD to identify problems based on specific patterns.

{
  "plugin": "custom",
  "pluginConfig": {
    "invoke_interval": "30s",
    "timeout": "5s",
    "max_output_length": 80,
    "concurrency": 3,
    "enable_message_change_based_condition_update": false
  },
  "source": "ntp-custom-plugin-monitor",
  "metricsReporting": true,
  "conditions": [
    {
      "type": "NTPProblem",
      "reason": "NTPIsUp",
      "message": "ntp service is up"
    }
  ],
  "rules": [
    {
      "type": "temporary",
      "reason": "NTPIsDown",
      "path": "./config/plugin/check_ntp.sh",
      "timeout": "3s"
    },
    {
      "type": "permanent",
      "condition": "NTPProblem",
      "reason": "NTPIsDown",
      "path": "./config/plugin/check_ntp.sh",
      "timeout": "3s"
    }
  ]
}

2. Writing Custom Plugin Script

The custom plugin script is the executable that performs the actual health checks. The output of the script must align with the patterns defined in the JSON file to trigger corresponding node conditions.

#!/bin/bash

readonly OK=0
readonly NONOK=1
readonly UNKNOWN=2

readonly SERVICE='ntp.service'

# Check systemd cmd present
if ! command -v systemctl >/dev/null; then
  echo "Could not find 'systemctl' - require systemd"
  exit $UNKNOWN
fi

# Return success if service active (i.e. running)
if systemctl -q is-active "$SERVICE"; then
  echo "$SERVICE is running"
  exit $OK
else
  # Does not differentiate stopped/failed service from non-existent
  echo "$SERVICE is not running"
  exit $NONOK
fi

Conclusion

Node Problem Detector is more than just a monitoring tool — it’s a safety net for your Kubernetes clusters. By expanding beyond default node conditions and offering unparalleled customization, NPD equips you to tackle challenges head-on, ensuring high availability and smooth operations.

Embrace NPD and take a proactive approach to node monitoring in your Kubernetes journey!